Comet’s Opik: The Future of Open-Source LLM Evals

 

Trilogy Equity Partners’ Managing Director Yuval Neeman had the chance to sit down with Gideon Mendels, Co-founder and CEO of Comet, to talk about his journey since starting Comet, iterating through the pandemic, and helping shape the rapidly changing GenAI landscape.

The Spark: From Scattered ML Notebooks to Observability

Before founding Comet, Gideon was a software engineer turned machine learning researcher. His “aha moment” came during his time at Google, working on hate-speech detection for YouTube.

“Everything was kind of managed in Google Docs and emails and there was no history of what experiments they’ve attempted. There was no actual knowledge of exactly what’s running in production or what data set it was trained on.”

Despite access to massive datasets, Gideon recalls that results were disappointing, and it was painfully obvious to anyone scrolling YouTube comments.

“Our models weren’t performing very well. I think one person called it the worst place on the internet.”

That frustration inspired him to call his longtime collaborator, Nimrod Lahav, and the two launched Comet in 2018 to bring reproducibility and observability to the ML world.

From ML to GenAI: Same Needs, New Surface Area

For several years, Comet focused on helping customers train their own models with their proprietary data, experiment tracking, dataset versioning, and model monitoring. But with the rise of large language models, customers’ needs started to shift. They realized customers didn’t want to train their own LLMs, but still needed observability and eval features, like tracing, for out of the box models from OpenAI, Anthropic, etc.

“We launched Opik which is our product for teams building on top of LLMs versus training the models from scratch.”

The new paradigm doesn’t make evaluation any easier; rather, it’s quite the opposite.

“In the GenAI world… there’s just no way you’ll be able to cover everything or think of everything. So, you need to find a way to get confidence in that what you built worked. But it’s not using unit tests… you kind of have to compare using semantics.”

The lesson: while the technology shifted, the need for systematic evaluation didn’t. In fact, with more unpredictability, it became even more important.

Gideon explains, “Software engineering has unit tests. GenAI doesn’t work that way. You can’t compare two outputs character by character. You need a methodology for evaluating whether results are useful, safe, or aligned. That process looks a lot more like machine learning than like writing code.”

The Flywheel: Why Evaluations Matter Most

Gideon argues that evaluations are the secret ingredient that separates prototypes from production-ready GenAI systems.

“If anyone takes one thing from this kind of conversation, this is the flywheel that distinguish between GenAI applications that will never make it to production to those that are waiting at production. You have to make sure that you’re constantly adding things to that data set and testing against it every time you make a change.”

Without this loop, companies risk endlessly tweaking prompts or switching models without knowing if they’re improving or breaking things. While with it, they are also creating a proprietary dataset that compounds in value over time – a key tool for creating moats in the GenAI era and an aspect Trilogy considers when evaluating GenAI startups.

Launching Opik: Open-Source, Community Velocity

With Opik, Comet has codified the evaluation loop for GenAI and turned it into a best practice that any team can adopt.

“We launched Opik… it’s actually been kind of exploding, it’s the fastest growing project in that space in the last year or so.”

And the traction speaks for itself:

“We did zero to twelve and a half thousand [GitHub] stars in about eight or nine months… and we’re getting all these pull requests… it’s been growing extremely fast.”

Comet decided to open-source Opik as they consider it the only way to set a standard in a market moving this fast.

“Everything we ship today is community driven.”

But open-source software brings its own operational quirks:

“The biggest challenge with open-source is you just have no idea who’s using it and what they’re doing with it.”

Monetization also works differently. Gideon points to companies like Databricks and Grafana, which spent years focusing solely on adoption before layering on revenue. Comet has taken a hybrid path, thanks to its existing enterprise customer base, but still invests “95% of its time” in community.

Looking Ahead: A Future of Continuous Optimization

For Gideon, the future of GenAI won’t be static applications calling external APIs, it will be adaptive, continuously improving systems.

“Crazy to think in one fundamental way when we switch from ML to GenAI we actually went backwards. In 95% of traditional ML use cases that we power today with customers, the models get retrained every week, every month, every quarter… [and as] you get more data, you retrain. Then your model gets better, and you avoid things like drift because the world’s changing and so on. But when you think about GenAI if you put an agentic application in production and you’re using let’s say OpenAI or Anthropic, it doesn’t matter if you get one interaction in production or a billion your system will not get better. There’s no mechanism to make the base model better other than that manual tuning that the users are doing. So when I think about it, it’s a huge step back, right? Because that’s really what makes ML successful.”

Through Opik’s optimization frameworks, he envisions a world where every GenAI application has its own evaluation loop, retraining itself just like an ML model.

Closing Thoughts

Gideon reflects on what he’d tell his past self about the founder’s journey with a little humor and grit:

“Don’t do it. I’m joking… probably half joking, I mean you have no idea what you are walking into. So just hold tight.”

As CLOUDBREAK listeners know, that persistence has paid off. Comet today powers AI teams at scale, while Opik is rapidly defining cutting-edge GenAI evaluation.

The next time a founder spins up a GenAI project, Gideon hopes it’ll be as automatic as running ‘git init’ and step one will be ‘import opik’.

 

You can listen to the full conversation with Gideon on all of your favorite platforms: YouTube, Spotify, Apple Podcasts, and Amazon Music.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.

Let's stay in touch.

Subscribe to have occasional news and updates from Trilogy delivered to your inbox.

Name*