Nicholas

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Nicholas

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. What you’ll learn: - How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries - Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly - The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent - How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work - Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” - How to build a scoring function live and let an agent improve your prompt inside a safe playground - How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person - Why fixing your CI is the highest-leverage way to speed up engineering velocity — Brought to you by: Guru—The AI layer of truth Persona—Trusted identity verification for any use case — In this episode, we cover: (00:00) Introduction to Ankur Goyal (03:00) Using AI agents for database optimization (06:10) Running exhaustive benchmarks with coding agents (09:03) Why staff engineers are wrong about AI limitations (11:30) The “agent line” framework for delegation (14:00) Ankur’s workflow: running 4 to 6 concurrent agents (17:16) Technical setup: foreground agents, background agents, and cloud environments (20:32) Spending time with AI tools (23:06) Demystifying evals (26:02) Live demo: Building an eval for documentation answers (30:20) The alternative to evals: vibe checks and whack-a-mole (32:09) Capturing designer taste in scoring functions (33:13) Quick recap (33:44) Managing velocity and throughput (35:40) Why CI/CD investment is critical for AI-accelerated teams (37:30) Ankur’s prompting strategy when agents fail (39:10) Closing thoughts and how to connect — Tools referenced: • Braintrust: https://www.braintrust.dev/ • Codex: https://openai.com/codex/ • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 • Claude: https://claude.ai/Other references: • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html • tmux: https://github.com/tmux/tmux • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/Where to find Ankur Goyal: LinkedIn: https://www.linkedin.com/in/ankrgyl/Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [redacted email].

Published
Published Jun 15, 2026
Uploaded
Uploaded Jun 15, 2026
File type
Podcast
Queried
0

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:44

[00:00] I'm still in, as I say, the year of our cloud, 2026. I still talk to engineers that say AI on our most complicated things cannot do a good job. I so viscerally disagree with there's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using an agent. Everyone should take a hard look in the mirror and reevaluate how they spend their time. [00:30] that you're making. And I think like many of these things to me fit below the agent line. I think the agent line keeps going up. Why do you think this concept is so important to understand? How can you just demystify it for folks who are a little intimidated by it? Now that models are so good at actually writing code, one of the best things that we can do is create really hard evals. And if you create the right tests and success criteria for a model, then it can be really creative and it can work on this stuff in the background and actually try to [01:00] go as so far as to turn my own taste or my own skills or my own expertise into a system. I'm functionally just building my own replacement. We're able to have David's palette applied to more things. I think the quality bar that we're able to hit is higher because we're able to get more things to that bar. Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. [01:28] Today I have Ankur Goyal, the CEO of BrainTrust. And this is a technical one. So if you're a senior or staff engineer or a VP of engineering or CTO, this is one you're really going to want to pay attention to. And we're going to talk about how coding agents can help you bite off really technical

1:44-3:31

[01:44] architecture and infrastructure work in a way that no other human engineer could before. [01:49] We're also going to demystify evals for folks and just show you exactly how you can use them. [01:55] to make your AI products better without having to touch a thing. Let's get to it. [01:59] This episode is brought to you by Guru, the AI layer of truth for your company's knowledge. Here's the problem. Your AI is only as good as the information you feed it. Most companies are getting confident but wrong answers from AI because their underlying knowledge is outdated, incomplete, or just plain incorrect. Bad information doesn't just slow you down. It costs you money and puts you at risk. Guru solves this by adding a verification layer between [02:29] your company's knowledge in AI tools. Instead of just hoping your AI gets it right, [02:34] Guru automatically scores content for accuracy, flags, outdated information, and ensures your team gets trustworthy answers every time. It works with the tools you already use, so you don't have to change how you work. Thousands of companies trust Guru to keep their AI accurate and compliant. Ready to stop playing Russian roulette with your company's knowledge? Visit GetGuru.com to learn more. [03:01] Welcome to How I AI. I'm excited to have you here. [03:04] I'm super excited to be here. Thanks for having me. So I'm going to make you laugh, but I recently did an episode about the recent GPT 5.5 model release. And I know you and I use Codex. And one of the funniest comments in that post was, Claire, can you do an entire episode about tech debt? And we were talking before we got on the recording. You're like, how technical and how nerdy is this audience? And I'm like, bring it on. So

3:31-5:02

[03:31] We are going to talk a little bit about how [03:34] you approach engineering and then how you use AI to do things like [03:39] optimize slow queries. So let's hop in. Tell me about your approach to software engineering in the age of AI. You know, I spend a lot of time working on software for doing evals and observability, and that's kind of shaped my own perspective about software engineering. [03:56] Like now that models are so good at actually writing code, [04:00] one of the best things that we can do is create really hard evals. And I'm not talking about like AI evals. I mean things like, why is this query so slow? [04:10] And if you create the right tests and success criteria for a model, then it can be really creative and it can work on this stuff in the background and actually try to improve a bunch of things. So one of the things that I spend a lot of time on right now is making decisions. [04:25] the queries that people run in our product faster. And people can just write arbitrary queries like, you know, they can [04:31] There's an example of someone who's trying to find, like, [04:34] a needle in a haystack of some specific kind of interaction someone had in their product [04:39] And they're looking at like billions and billions of traces. And they want to find like the 5,000 or something that match. And this is over like... [04:47] a 90-day period or something, like a lot of data. [04:51] And that's one example of a query in like, okay, there are all these things that you can do in database literature, like, [04:58] different indexes you can build and different ways you can prefetch data and blah, blah, blah, all this stuff.

5:02-6:33

[05:02] But how do you try all those things? And how do you run all the experiments required to actually do something like this? [05:10] So what we do and what I've personally spent a lot of time working on [05:13] is trying to figure out, you know, manually is fine, but automatically is even better. Like, what are the [05:19] patterns of queries that people are running [05:22] that are slow. [05:23] and then we will reproduce those things [05:26] and use a coding agent [05:29] to try out a bunch of ideas from database literature. So like download a bunch of data locally, and then maybe try different, in this case right now, I'm trying out different column store formats. So we use an index underneath the scenes called Tantivee, which has a built-in column store. [05:46] But it's not that great. Like the thing overall is great, but their column store is not like that great. [05:51] And so what we're doing right now is like exhaustively trying [05:55] every open source column store format out there, and then exhaustively trying every column store execution engine out there and sort of computing the matrix of this. [06:05] And, you know, it's like it's amazing. [06:08] I completely agree. As somebody who has led engineering organizations for a really long time, [06:13] When you're trying to make [06:15] infrastructure platform core component changes in your application. [06:20] Because of both the cost of implementing those being very high, [06:26] And then the unknown unknowns being quite risky. Teams are actually pretty risk averse in terms of making...

6:33-8:04

[06:33] big platform shifts or changes to their core implement. It's like the thing that you chipped is the thing that you get stuck with. Certainly. [06:41] on the engineering side. [06:43] And what I love about AI right now and these coding agents in particular, and then Codex in particular, particular, is it has been the only setup, Codex plus these GPT models has been the only setup. [06:57] where I have been able to set up a very similar process, which is the outcome I want is XYZ. [07:04] We need to programmatically test [07:07] against pretty long tail... [07:10] data structures [07:11] to figure out which of these potential solutions are going to get us closer to the outcome we want. [07:17] In your instance, it's database query speed and latency. [07:21] In my instance, I was doing a very, you can appreciate this, very complex data migration of stored, structured and unstructured data generated by AI. So it was all messed up to begin with. And then I had to migrate it to... [07:35] a schema. And so it was like schema to schema migration, [07:39] millions and millions and millions and millions of rows and lots of edge cases. [07:44] And doing that as a human... [07:46] takes forever. [07:48] You know, you could script it and you can like bang some systems against it, but then your human ability to manage those cycles and say, yes, that's right, or no, that's wrong, or this gives us indication that we should go left or that gives us indication we should go right. [08:01] And so I do feel like this combination of like,

8:04-9:36

[08:04] a very precise outcome [08:06] And an agent that's smart enough to bang its head? [08:09] against a really, really long tail of problems. [08:12] with a guided sense of the technical space, it does really well. And I have not heard this [08:19] on the kind of like data store side is really interesting. [08:23] But I just think, hey, engineering leaders out there, I've had I've been so many debates about [08:28] what we're using for our data score [08:31] how we optimize performance, what technologies we should bring into the stack versus not. [08:35] And you can run those like very, very iterative loops. [08:38] on-- I'm presuming you're using production-like data [08:42] or real representative queries to test that. Is that right? You can actually use production data too. [08:50] But for some subset of things and with the right [08:53] engineering in place, you can just run on production data. [08:56] Yeah. And in many in many ways, it's a lot safer than having humans test on the production data because no one no one's looking at it. [09:03] Yeah, and this is where I have so many staff engineers. [09:07] be really, really cynical about does AI have a place in... [09:14] their coding tools. I'm still in... [09:17] as I say, the year of our cloud, 2026, we still, I still talk to engineers that say AI on our most complicated things. [09:26] cannot do a good job. Oh, I so viscerally disagree with that. Same. Tell me why you disagree. Well, I mean, I think so. I've been working on databases for almost two decades.

9:36-11:09

[09:36] There's not many things that staff, whatever, risk averse, blah, blah, blah, all that stuff you could apply to than like literally building a database. [09:44] If you work on a database, [09:45] We recently added this like fancy index thing into Braintrust that uses Bloom filters. [09:51] And by the way, we discovered that that would be a practical solution to the problem after running like a week of continuous experiments with different types of indexes. [10:00] Bloom filters kind of have a bad reputation, but they... [10:03] They worked out to be very effective in this case. [10:05] So if you build something like that, usually what happens is, [10:09] The very best engineers will run a few benchmarks. [10:12] And then you'll send it to your peers and then your peers will [10:15] shit all over it and rip it apart and say, you didn't benchmark this, you didn't benchmark that. [10:19] And what you do is you prioritize the top few benchmarks and then you probably bullshit the rest. [10:24] Like, oh, I didn't benchmark this. However, if you read the code, you'll see it's not N squared. It's log N. And so this is not going to happen. [10:31] And like half the time you're wrong. Now there is no excuse [10:36] to not do those benchmarks. So now, I love it. Like we don't, I'm not spending my time like sitting and like typing the benchmark code. [10:43] But I'm talking to people, we're looking at the code, we're looking at the thing, we're like, okay, well, we benchmarked how much faster it makes the queries. Did we actually do a good job of benchmarking how much slower it makes indexing? Oh, shit. No, we didn't do that. [10:57] And so we actually spent some time doing that and we discovered that [11:00] we were doing a terrible job at indexing it efficiently. And so we spent a lot of time on that. And I could sort of, I don't agree with this. I could

11:09-12:42

[11:09] I can empathize with the argument that [11:12] models aren't good at writing highly concurrent code or they're not good at writing very performance sensitive code. [11:17] But, [11:18] There's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using an agent. And even that baseline is just incredible. [11:32] I agree. And I think there's this theoretical quality and then there's this practical quality, right? In a theoretical way. [11:39] ideal world in which we [11:42] don't sleep. And every time we sit down at our laptop, we end up writing perfect code. [11:47] And in a theoretical world in which those benchmarks are, all of them are run, not just the BS ones. Like in that theoretical world, you could theoretically say perhaps in some... [11:58] untested case, you get better quality when humans are hands-on. [12:01] But the practical application is you lose. [12:04] context over like the humans lose context on the problem over days. You have a decaying attention span. [12:12] towards [12:13] hard but tedious problems, [12:15] And so I do think the practical quality goes down. And so I tell people like the practical quality of integrating AI into your engineering process on very hard technical problems. [12:27] goes up simply because of how hard you can run at the problem and how long and consistently you can run against the problem. And then what I was going to go back to saying, which is you can bite off [12:39] much more interesting technical challenges

12:42-14:15

[12:42] with AI as your sidecar than you could before. [12:46] again, practically, because your company can support [12:50] the cost of doing so. Right. [12:53] If you're like, I want to sequester all my staff engineers to solve our database indexing problems for the next year. And we're just like really going to go deep in the weeds. And we're going to test these six different open source solutions to this. And we'll come back in a year and we'll tell you if we figured it out, not that we figured it out. [13:10] Business is like, no, you know, you're you're they're a CEO like, no, no, thank you. [13:15] But if you say, hey, we're going to have this thing in the background and we're going to check on it, we're going to make expedient progress. [13:21] And we can ship other stuff while we're at it. I think that's a really easy... [13:25] Yes. [13:26] Absolutely. Yeah, I mean, I think the motto that we have now is there's just no excuse to not have rigor. Like if and there's no excuse to not have performance. If someone complains about something. [13:36] If someone complains about [13:38] a paper cut in the UI, you know, whatever it is, there's just no [13:41] We don't really have a backlog. There's no excuse to just not improve these things. [13:47] Yeah. And for folks looking for we don't have a backlog inspiration, we just interviewed Brian from Intercom, who said their goal is like backlog zero. [13:56] Nothing in the backlog. [13:58] so that everything can get shipped okay [14:00] So we're solving really technical problems. I think this is a great approach. [14:04] How are you? [14:05] engineering with AI, because I love that you're still, you know, you're writing code, you're spending time on this. [14:10] Any tips or tricks for how you're managing your fleet of agents that you think are unique?

14:15-15:48

[14:15] I think that everyone should take a hard look in the mirror and reevaluate how they spend their time. There's a lot of interactions that you have or direction that you're giving or decisions that you're making. And I think like many of these things... [14:28] to me, fit below the agent line. And to me, the agent line is like, [14:32] If I or whoever would be at the meeting or whatever, like if we equivalently took the information that we're discussing and we just gave it to an agent, would it? [14:41] solve the same problem. And I think the agent line keeps going up. [14:45] And also I think the best people... [14:47] are pushing the agent line inside of their company by being smart about what skills they're writing and what integrations they're building and so on. [14:54] So once you do that, you likely... [14:57] have a lot more time than you thought you did. I don't take any meetings after 12. This is the last meeting of the day for me. [15:03] And that means that every day I am able to, in the Paul Graham framework of maker versus manager schedule, every day I'm able to enter the level of focus that's required to be in the maker schedule. And so I personally write a lot of code and I spend a lot of time writing code and I haven't spent as much time writing code in a while and I really love it. [15:24] That's number one, is like make the time. [15:26] My workflow is very simple right now. [15:30] We don't have a great background agent set up yet. I think that we are exploring various things and trying to get there, but I have usually five or six foreground agents running on my computer. [15:43] Each one is a TMUC session. Right now I have four things I'm working on. So each one is a TMUC session.

15:49-17:22

[15:49] They're named Braintrust 1 through Braintrust 4. And, you know, each of these has, like, some UI running, and it has some services running. They're... [15:58] Problems like port collisions, like I can't, [16:01] isolate everything as much as I'd like to. And I think that [16:04] There are a lot of solutions for trivial software that do this. There's not a lot of solutions for... [16:09] complicated software yet and I'm excited. I mean, everyone I talk to is building their own thing. I just met a startup that's like two months old and they built their own internal tool for doing [16:18] background agent PRs, [16:20] Which is... [16:21] I don't judge them for it. Like, I don't know what else they would do. [16:25] But it's kind of crazy. And then I also have remote ones. So here's one where I'm working on [16:30] trying to improve our column store performance [16:33] And this is running on not real data, but close to real data. And it's running remotely. And it's running much more scale. And I mean, if I ran this on my computer, [16:45] it would probably die from just how much [16:49] compute it's using. But I'm able to, in this case, test like, what's the real latency between EC2 and S3 if I'm trying to do like 4,000 concurrent reads? Is it enough? Is it not enough for this workload? Can I interleave things whenever properly... [17:04] And I've been running this experiment for several days, just trying to figure out like what's the best, you know, right now I'm talking to it about what the indexing lifecycle should be because I think we figured out. [17:15] how to make the queries fast enough. Some people are going to be listening to this and be like, oh my gosh, this is so technical. I don't have these problems. Let me take a step back for folks.

17:22-18:56

[17:22] and tell you what I think I'm seeing here, which is one, [17:26] You're using codecs, right? [17:28] Yeah. [17:29] Codex for hard problems, people. I'm telling you. I think it's currently the only model [17:35] that will disagree with you regularly. And I think if you're working on hard problems, it's very important. [17:40] And then for you, what I'm also hearing is you're using foreground agents. You basically have a personal concurrency limit of like, let's call it four, which is about what I can do as well. So I think people ask me all the time. [17:54] how do you handle all this context i'm like i don't do more than i think i can do at any one time [17:59] And I also I have more trivial problems than you. So I think you're right in that the current sort of commercial background agents, I would call them, that you can buy off the shelf. [18:09] work very well for web at like standard web apps. I'm very happy with them. If you are not using one of them as an engineering organization, maybe it's like doing classic SaaS. [18:18] Highly, highly, highly recommend. [18:20] But I am hearing more and more from teams two things that you called out. [18:25] I'm hearing more and more people are just building their own background agents. [18:29] So it's happening. It's happening in teams very, very big and very, very small. I think the primitives are there to start experimenting with it. And so I don't think it's going to be as surprising to us to hear about people building their own. [18:43] internal coding background agents, even if like core infrastructure is something from the big models, model providers. [18:50] I think [18:50] The second thing that I'm hearing a lot, and we heard this from the Stripe team, is investment in...

18:57-20:31

[18:57] cloud development environments and remote remote computing again because [19:03] If you were to run some of the stuff, especially the data heavy stuff on your computer, it starts to sound like an airplane taking off. It's no good. [19:10] And then the last thing I heard you say, which is like ports, I joke with everybody, I say work trees everywhere. [19:16] Ports 3000 through 3009 accounted for like [19:20] I am just like every everything. [19:22] And I have to call out Chris Tate at Vercel released a thing called Portless, which just makes managing multiple ports, local host ports on your local machine a little nicer. So for simple things, I would go look that up. We'll link it in the GitHub show notes. But common problems that I think people have running concurrent engineering processes. [19:41] on their own machine and then the like meta thing which is just like make time [19:47] to code [19:48] You need it. Everyone. [19:50] I also don't take meetings after one. Sometimes I'll do podcasts in the early afternoon for folks, but all afternoon I'm just like in my real state, which is hoodie on. [20:00] bad posture. I think that I'm sure you feel this too, but like [20:04] When I was handwriting most of my code, [20:07] I would enter this sort of like euphoric flow state where I, you know, just completely focused on a problem. [20:14] And then when I started doing a lot of agent coding, I lost that for a little bit, but now [20:18] When I'm writing code, you know, Lane8 just released a new album yesterday. You should listen to it. [20:23] put on your hood and your headphones. I'm like totally back in that state now, just doing a different workflow.

20:32-22:03

[20:32] Yeah, and I'll give folks the sort of, you know, AI mom of the internet that I try to be. [20:37] Which is I do feel like a lot of people are, they kind of go into two camps. [20:42] They are having more fun than they've ever had before. And they're back in the flow state of like what got them into software engineering or building or technology or whatever. [20:50] Or they're approaching like, [20:52] Claude anxiety and... [20:54] burnout breakdown because they feel this like productivity... [20:59] anxiety and they're not I think I think what I see is that people feel like if they're in a meeting and they're not kicking off agents, they're doing something wrong or they're talking to somebody and they're not kicking off agents or doing that. [21:09] And I just say, like, I like the idea of chunking your time with AI a little bit more. Yeah. I think it just narrows you on the more productive pieces of it and is also just a more enjoyable way to get stuff done. [21:24] Yeah, I had a phase, which I think I'm over. You know my wife, Alana. [21:29] where we would have dinner together usually pretty much every night. - Yeah. - And so, [21:36] I had a phase where my laptop was not at the table, but open and on the couch. And I think I've progressed beyond that phase now. So now the laptop is closed. And I think it's an important thing. [21:50] I agree. When I was first using OpenClaw, I installed it on an old MacBook. [21:55] And it would like stay open on our kitchen island, which is where all our plugs are. And it would like hover over us at dinner and hover over us at...

22:03-23:51

[22:03] at breakfast and if it got moved i was like where is polly is she alive is she open is she closed so yes close your laptop [22:11] People close your laptop. [22:13] This episode is brought to you by Persona. You're learning to build with AI, but there's an important question you need to ask. Who is actually using your product? Is it a legitimate user, a bot or a fraudster? [22:26] Brex, Figma, Etsy, and Twilio trust Persona to answer that question. With Persona's identity verification platform, you can create branded experiences, automate fraud prevention, and know who is human online. [22:40] That makes it easy to give good users an experience that makes them feel welcome and to stop bad actors from causing damage. And for those of you building in the AI agent space, Persona helps you verify the identities of people, [22:54] businesses and developers behind agents. It's how companies like Lithic and Skyfire are pushing the frontier of agentic commerce. [23:02] Learn more at withpersona.com. [23:06] All right. So, you know, we covered the first half of this episode, which I think is very interesting for technical folks, how to have kind of like long running or just really diligent agents run against technical problems to give you real benchmarks about performance on changing things. [23:22] I love that. Second thing is just your core workflow on how you do coding, both how you dedicate time and then technically just what your workflow looks like. [23:30] Let's talk about... [23:32] evals. [23:33] Because I feel like this is something that's very intimidating to a lot of people. And obviously you build a product that supports this. But taking a step back, why do you think this concept is so important to understand? How can you just demystify it for folks who are a little intimidated by it? Machine learning specifically shifts the task of...

23:51-25:24

[23:51] programming from being about [23:53] the how to being about the what. [23:55] And this is true. [23:57] Like, forget about LLMs, like, you know, it's true. [24:00] with... [24:00] Let's say like you're back in like middle school, you're doing like remember statistical regression. [24:06] You're not defining the... [24:08] You're computing what the slope and the y-intercept should be. You're not defining it. [24:11] But you give it all the points, which are the what, not the how, which is the slope and the y-intercept. [24:17] And I think that, you know, the cool innovation around [24:20] like transformers and the next token prediction task, which lets you, you know, ablate tokens and do all this cool stuff. It's all about saying like, okay, [24:30] Here's the compute substrate. [24:32] And here's the what, which is the outcome. It's predicting the next token. [24:37] Can you go and... [24:38] use a lot of GPUs and figure out how to achieve that. [24:42] And I think that [24:43] If you... [24:45] take that as inspiration for anything you do with AI, then you're able to be more productive. And I [24:51] traditional programming like what we just talked about i'm not [24:54] dictating [24:56] exactly the implementation or even the set of algorithms that we're using to solve problems. I'm just trying to define [25:02] very succinctly what the [25:04] problem is and why it is a problem and how to assess the solutions to the problem. [25:10] It also applies to building AI software, and that's what evals are all about. Evals are... [25:15] a methodology for you to say, [25:17] this is what success looks like. In my opinion, evals are actually the modern version of a PRD.

25:24-26:55

[25:24] So a PRD, you would say, hey, in prose, this is what success looks like. [25:29] evals are also often written in prose. [25:33] but you supplement that with examples. So, you know, the best PRDs, [25:39] they have good examples. Maybe someone's made a demo or... [25:43] written out like a user story or something. It's the same thing. The difference with evals is [25:50] You encode those user stories in a way that [25:54] can be quantified to some extent and then you let a model or whatever figure out the how and you are really focused on the what. Give an example of how you use this in product development just to make it a little bit more tangible for folks. Yeah, let's start with something that I think is quite straightforward and then we can venture into the less straightforward stuff as we go. Okay. [26:16] This is our UI, and I'm working on a very simple task here, which is [26:22] I'm trying to create a prompt that will be part of an agent [26:26] that is good at answering questions about brain trust documentation. So we looked at a few questions that people are asking [26:34] in our docs and we just put them into a data set you can like [26:37] upload a CSV file, like it doesn't matter. It's just come up with a list of some questions or you can auto-generate them, you know, whatever. Just start somewhere. [26:45] and wrote like a very basic prompt. [26:47] We're gonna use GPT 5.4 mini, and I attached an MCP server. So I attached the BrainTrust MCP server,

26:55-28:25

[26:55] We were also playing around with context seven, which indexes docs for you. You could also turn off the MCP and just see what the model already knows about your product. They're getting pretty good at knowing about every product now as well. [27:08] And here I just ran it. And so you can see some of the answers. [27:13] I'm going to be honest, though. I don't really want to read all of these manually. [27:18] And so [27:19] What I would usually do is I just start by saying like, [27:23] Hey, can you... [27:24] Come up. [27:26] with a good scoring function [27:29] for these outputs. [27:31] I care about having concise, [27:35] Code snippets only using one language. [27:39] And... [27:40] let's say [27:42] avoiding m dashes. [27:45] What? [27:45] Always. Yeah, of course. [27:48] And so now in this case, GPT 5.4 is going to go and actually look at all this stuff for me. [27:55] And it's going to look at some of the outputs, and it's going to rerun stuff, and... [27:59] It'll kind of do its thing [28:01] and it's going to come up with a new scoring function. One of the things I think, by the way, that's kind of cool about this workflow in general, and I expect to see this in more products over time, [28:11] is that you'll notice like I have this in the equivalent of like unhinged mode of a coding agent. [28:18] which is sometimes dangerous to run on your machine. [28:21] But this agent is running inside of this playground and it's using like

28:25-29:56

[28:25] data and some prompts and stuff. So the risk of letting it just go and try stuff out is actually very low. And so I think I'm excited just generally about seeing agents in more environments outside of [28:39] my local computer would bash and something that's very dangerous. And it could screw up my life if it goes wrong. [28:46] I'm excited about just having more agents that sort of run in these types of environments and do whatever they want. Like, I don't even know what this is doing right now, but we'll find out in a few minutes. I'm really excited about this. And just for people that are not watching or need just another set of context, basically what you did is you took these questions. [29:02] that people are asking in your doc site or search or whatever chat bot about how the product worked. [29:08] You built a little prompt to answer those questions, and then right now you're building [29:14] you're having AI build a score that tells you how well these questions are getting answered based on like a very loose definition of what [29:21] you want it to do. [29:23] And then is that scoring mechanism applied across all of these so you can actually rank it? Yes. Yeah, yeah. I think it's going a little bit awry, actually. So I'm going to switch to this one. [29:34] which is a little bit better. We do it live. We love a live demo. I know. And let's use Claude and give it a shot. [29:42] So this one is a little bit cleaner, and it actually wrote a prompt. Well, let's use a smarter model. It didn't pick the smartest model. [29:50] It wrote a prompt which takes the input and the output, and then it evaluates it on these criteria. Yeah.

29:56-31:29

[29:56] It is a pain in the ass to write these criteria out by hand, so it's really nice. [30:00] to just let a model do it for you. Yep. And what we can do is run it and it will quantify [30:06] how well how well the model does on on these criteria. And then we can look at it like one by one or or [30:16] What I actually tend to do nowadays is look at it in aggregate. And so the scores will start coming in here. What's the alternative that people are? What do you see people doing as an alternative this that you think is less effective? One is just not doing it. I know a lot of people. [30:30] Yeah, I mean, I think that a lot of people, and I fall into this trap [30:33] myself, [30:34] despite working on this product, so there's no judgment for doing this. [30:39] But I think what a lot of people do is they just try stuff out on one or two examples and they try to generalize from that. [30:45] And... [30:46] Frankly, I don't think that's a bad idea. I think that vibe checks are extremely important. But what happens is that if you do this, you end up playing kind of like a whack-a-mole game. So you might make it really good at one or two things, then you ship it, and then it's not good at something else. And what we do, we have this designer named David [31:05] And David is really cool. Like he dresses well. He has like, like he's into the latest music. He, he, he like, he likes music. [31:14] before other people do, he told me that when he was a kid, [31:17] He played soccer and everyone had black shoes. [31:20] And he wanted the orange ones. And then the next year, everyone wanted the orange shoes. So he's like that kind of person, right? Yeah. And we have a lot of AI stuff going on.

31:29-33:02

[31:29] It's not practical for David, who has like the ultimate, who's the ultimate brain trust tastemaker to look at everything manually. [31:37] And what I actually do is I run a shit ton of evals to try to quantitatively improve things. [31:44] And then when I feel like the evals are good in my own, [31:48] less sophisticated palette thinks that the results are good, I will go to David and ask him for a vibe check. And I probably do that like once every few days. [31:56] And then David gives me the vibe check. And like half the time, he just completely destroys everything that I've said. Like, hey, you know, you think it's good, but it's actually not very good. And then what happens is I will go back and try to capture what David said. And I'll say like, [32:10] you know, hey, David actually thinks it's okay to show both languages as long as, you know, blah, blah, blah, blah, blah. And then I will, so I'll try to sort of capture David and then... [32:22] improve the scores and then attempt to quantify David. And then the next time I go to him, I don't like repeat the same mistake, but I still get his vibe check. Well, and I just have to call out the meta thing here in this David story, which I love, which is I have a lot of people saying, wow, if I go as so far as to turn my own taste or my own skills or my own expertise into a system, whether that system is like the David eval, the David, David in a loop judge or something else. [32:50] I'm functionally just building my own replacement. And I am presuming, because I do, and it sounds like you do too, you value David more in this system. Oh, yeah, yeah, yeah.

33:02-34:38

[33:02] we're able to have David's palette applied to more things like the, the, [33:06] I think the quality bar that we're able to hit is higher because we're able to get more things to that bar. I love it. Okay, so this has been a powerhouse episode, one of my favorites. We've talked a lot about solving really technical problems with AI. [33:22] We've demystified evals a little bit for folks and shown how [33:26] In a safe space, you can actually let AI, I think that's one of the meta themes of this is in a safe space, you can let AI run. [33:34] with a lot of autonomy and you'll throw a lot of data at it and you can get higher quality outcomes, much more so than if you were to manually fix things or even manually evaluate things. [33:45] I'm going to do a quick lightning round. [33:47] And then we'll get you back to, I mean, it's almost noon. So back. It's time to code. Time to code. [33:54] Code. [33:55] One, I have a question. When you say there is no excuse, there's no excuse for bugs, there's no excuse for... [34:01] Little design knits. There's no excuse for that. How do you feel like you practically... I maybe have two questions that you can answer. They'll be our two laying around. [34:09] How do you practically manage the velocity to customers [34:13] which is, do you ever get customers being like, wait, what's this? Wait, what's that? Like, [34:17] too much features just consumed as a customer. [34:20] And then two, how do you technically manage the throughput into the system? [34:25] product building and code writing [34:27] is now looks like carving rather than constructing. [34:31] So it's very fast to create something that has too many features and too many buttons and too much code.

34:38-36:09

[34:38] And you need to spend a lot of time removing stuff. [34:41] And so, [34:42] We actually... [34:43] I would say 90% of the time someone complains about something, we remove... [34:48] the thing that was causing confusion and just make the system work better. Because we understand [34:53] now that the person complained [34:55] their point of view, and we're able to build a product that doesn't even need the complexity that led them to the confusion in the first place. I'll give you an example. [35:04] If you load a trace, [35:05] and you imagine hitting Command F, [35:08] You might in your brain think that that's just searching what's on the page. [35:11] But what's on the page might be [35:13] hundreds of megabytes of text and it's virtualized and then there's it's across spans and there's also a table [35:19] So we had a very powerful search implementation that would search across the spans and rank everything and, you know, blah, blah, blah, all this cool stuff. [35:27] And then a lot of people complained and they were just like, why is this? You know, I just hit command F. I just wanted to show the thing. [35:34] And we've really simplified it over time. So I think we try to carve. And then in terms of technically managing it, [35:42] We spend a lot more time working on CI than we used to. [35:45] And so I think that a lot of platform effort has shifted so that if we are really good at CI, [35:53] then we were able to move faster. [35:55] And if we feel like we're constrained, [35:57] then instead of shipping a bunch of crappy stuff, we're like, okay, let's pause and improve CI so that we earn the ability to move faster. [36:05] Okay, again for the VP of Engineering in the back.

36:09-37:39

[36:09] Investancy, I've told everybody, they're like, how do I accelerate my engineering velocity with AI? I was like, fix your CI. [36:16] Yeah, yeah. Start there. [36:18] Every engineer is now building a platform [36:22] And upon the platform, agents are doing the work that the engineers were doing manually. Right. And I think that applies to evals like. [36:30] If you're an engineering team and you're building an AI product, the number one job for you is to build a feedback loop. [36:37] Meaning you have a pipeline that allows you to summon from the ether of real world data [36:43] and turn that into evals. And as an engineering team, that is your number one job. It is not prompt engineering. [36:49] It's not picking an agent framework [36:51] It's not rewriting your database, whatever. It's creating that [36:55] pipeline. And the same is true. CI is that same [36:59] idea but applied to software engineering. Well and I'll give one other tip which is you think that those evals people are always like oh yeah for my AI product I need that. [37:08] I have seen, again, I think the intercom team [37:11] has run a bunch of evals on their internal use of quad code [37:15] to figure out [37:17] where engineers are hitting pain points, where people are giving up, where the [37:21] Agents are asking for permissions that have to be escalated. And I think that sort of analysis on your team is very, very important and ultimately gets you to these. [37:31] these better outcomes. Okay, last question. You seem like a very reasoned person, so I'm presuming I'm going to get a very reasonable answer. [37:37] But I ask everybody,

37:40-39:12

[37:40] When one of your four tabs is not... [37:44] doing what you want. [37:45] When the evals are failing the David test, [37:48] what is [37:50] in-your-bat-pocket prompting strategy that you rely on. Do you yell? [37:55] Do you bribe? Close the session. [37:58] and then I improve the evals, and then I try from scratch again. Yeah, yeah. This is a man who is on message. [38:07] Yeah, I mean, I'll give you like an example. We have this open source use case, sorry, a use case where we run open source models, and we're running like millions of tokens per second. It's very, very high scale. So every cent matters and every bit of optimization matters. [38:23] We are trying to change right now from Model A to Model B. [38:26] And I, again, I'm someone who... [38:29] Build software to write evals. [38:30] I vibe coded [38:33] an eval script and it went, it just was getting stuck. [38:37] And then I read the code and it's like 3,000 lines of... [38:40] complete trash. And it had like all these scoring functions and all this crap and it was getting confused. [38:46] And so I... [38:47] On Saturday, I hand wrote [38:50] Like, no... [38:52] No co-pilot, no autocomplete. [38:55] I've been... [38:56] Partly to improve my own understanding of the problem, I hand wrote the eval. And then by the end of Sunday, the problem was solved. So you shut the session, you do it yourself. [39:05] Yeah. Just for the eval. Just for the eval. [39:08] Great. [39:08] This has been so great. Where can we where can we find you and how can we be helpful?

39:12-40:09

[39:12] If you are interested in evals or you're trying to solve AI observability problems inside your company, [39:19] please check out Braintrust. We're at braintrust.dev. [39:23] at Braintrust on X or I'm at A-N-K-R-G-Y-L. I'm very happy to chat. We're also hiring. If you like working on these problems and you like [39:31] maybe pushing the boundaries of rigor and stuff, and found this kind of stuff interesting. [39:37] We'd love to work with you. Well, thank you so much for joining. This was great. [39:41] It was a lot of fun. [39:51] You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time.

Want to learn more?