Loading...
Loading...
0 / 10 episodes
No episodes yet
Tap + Later on any episode to add it here.
For all those who missed out on London, see you in Miami next week! Notion, the knowledge work decacorn, has been building AI tooling since before ChatGPT, with many hits from Q&A in 2023 and unified AI in 2024 and Meeting Notes in 2025. At the end of their last Make user conference, Ryan Nystrom teased Notion 3.0’s Custom Agents - and they are finally embracing the Agent Lab playbook! Sarah Sachs and Simon Last of Notion join us for a deep dive into how Notion built Custom Agents, why it took years and multiple rebuilds to get right, and what it means to turn a productivity tool into an agent-native system of record for enterprise work. We go inside the product, engineering, evals, pricing, and org design decisions behind one of the most ambitious AI product efforts in software today — from early failed tool-calling experiments in 2022 to agent harnesses, progressive tool disclosure, meeting notes as data capture, and the long-term vision for software factories and agentic work. We discuss: * Sarah and Simon’s path to launching Notion Custom Agents, and why the feature was rebuilt four or five times before it was ready for production * Why early agent attempts failed: no tool-calling standard, short context windows, unreliable models, and too much complexity exposed to the model * The “Agent Lab” thesis: not just wrapping a model, but understanding how people collaborate and building the right product system around frontier capabilities * How Notion thinks about roadmap timing: not swimming upstream against model limitations, but also building early enough that the product is ready when the models are * Why coding agents feel like the kernel of AGI, and how Notion is thinking about “software factories” made up of agents that spec, code, test, debug, review, and maintain codebases together * How Sarah runs AI engineering at Notion (“notes from Token Town”): objective-setting over idea ownership, low-ego teams comfortable deleting their own work, and a culture designed to swarm around fast-changing opportunities * The “Simon Vortex,” company hackathons, and why security gets pulled in early rather than late * How Notion organizes AI: core AI capabilities and infrastructure, product packaging teams, and a broader company mandate that every product surface must increasingly work for both humans and agents * Why prototypes have become much easier to build internally, and how “demos over memos” changes product development inside a tool the whole company already uses every day * Notion’s eval philosophy: regression tests, launch-quality evals, and “frontier/headroom” evals that intentionally only pass ~30% of the time so the company can see where model capabilities are going * What a “Model Behavior Engineer” is, and why Notion treats eval writing, failure analysis, and model understanding as a distinct function rather than just software engineering * The changing role of software engineers in the age of coding agents, and why the new job looks less like typing code and more like supervising a rigorous outer system of agents, PRs, and verification loops * How the “software factory” should work: specs, self-verification, bug flows, subagents, and minimizing human intervention while preserving the invariants that matter * A live walkthrough of a Notion Custom Agent handling coworking space tenant applications by triaging email, enriching applicants with web search, and writing structured data into a Notion database * How agents compose inside Notion: shared databases as primitives, agents invoking other agents, “manager agents” supervising dozens of specialized agents, and memory implemented simply as pages and databases * Notion’s take on MCP vs CLI: why Simon is bullish on CLI’s self-debugging nature, where MCP still makes sense, and how Sarah thinks about capability, determinism, permissioning, and pricing alignment * The evolution of Notion’s internal agent harness: from early JavaScript coding agents, to custom XML, to Markdown and SQL-like abstractions, to tool definitions, progressive disclosure, and a much shorter system prompt * Why Notion cares about teaching “the top of the class,” building for sophisticated operators rather than abstracting away too much capability for everyone * How agent setup works today: agents that can configure themselves, inspect their own failures, and edit their own instructions — with guardrails around permissions * How Notion prices Custom Agents: credits as an abstraction over tokens, model type, serving tier, web search, and future sandbox costs; why usage-based pricing was necessary; and how “auto” tries to match the right model to the right task * Why Notion is not eager to train a foundation model, where they do fine-tune and optimize today, and why retrieval/ranking is one of the most important investment areas as more searches come from agents rather than humans * Why Meeting Notes became one of Notion’s strongest growth loops: not just as transcription, but as high-signal data capture that powers search, custom agents, follow-up workflows, and the broader system of record for company collaboration * Why Notion is more interested in being the place where collaboration data lives than in building hardware themselves — and how wearables or other capture devices may eventually feed into that system Sarah SachsLinkedIn: https://www.linkedin.com/in/sarahmsachsX: https://x.com/sarahmsachs Simon LastLinkedIn: https://www.linkedin.com/in/simon-last-41404140X: https://x.com/simonlast Full Video Episode Timestamps * 00:00:00 Introduction and launching Notion Custom Agents * 00:01:17 Why Notion rebuilt agents four or five times * 00:03:35 Building for where models are going, not just where they are * 00:05:32 The Agent Lab thesis, wrappers, and product intuition * 00:08:07 User journeys, leadership, and low-ego AI teams * 00:13:16 The Simon Vortex, hackathons, and bringing security in early * 00:16:39 Team structure, demos over memos, and building for agents * 00:20:25 Evals, Notion’s Last Exam, and the Model Behavior Engineer role * 00:27:37 Evals as an agent harness and the changing role of software engineers * 00:30:42 The software factory: specs, verification, and agent workflows * 00:32:18 Live demo: a custom agent for coworking space applications * 00:35:08 Composing agents, manager agents, and memory as pages * 00:38:15 Notion Mail, Gmail, native integrations, and tools * 00:39:43 MCP vs CLI and the cost of capability * 00:44:13 When Notion uses MCP vs building its own integrations * 00:47:43 The history of Notion’s agent harness rebuilds * 00:55:35 Power users, public tools, and the setup agent * 00:58:01 Self-fixing agents, permissions, and “flippy” * 01:01:13 Pricing, credits, and choosing the right model automatically * 01:09:01 Why Notion isn’t training its own frontier model * 01:14:07 Retrieval, ranking, and search built for agents * 01:17:27 Meeting Notes as data capture and workflow automation * 01:21:18 Wearables, hardware, and Notion as the system of record * 01:23:45 Outro Transcript [00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast. This is Alessio founder of Kernel Labs and I’m joined by swyx, editor of the Latent Space. [00:00:11] swyx: Hello. Hello. We’re back in the beautiful studio that, uh, Alessio has set up for us with Simon and Sarah from Notion. Welcome. [00:00:18] Sarah Sachs: Thanks for having us. [00:00:19] Alessio: Thanks for having us. Yeah. [00:00:20] swyx: Congrats on the launch recently the custom agents, finally it’s here. How’s it feel? [00:00:26] Sarah Sachs: We ship things slowly. So it had been in Alpha for a little bit and at the point at which is it’s an alpha, um, there’s a group of people that are making sure it’s ready for prod, and then there’s a group of people working on the next thing. So sometimes some of these launches are a bit delayed satisfaction, so it’s quite nice to remind yourself all the work you did because we do have a habit of like. Being two or three milestones ahead. Uh, just ‘cause you have to be, you know, you can’t get complacent. Um, but it’s been great that people understood how this is helpful. And I think that’s just easier in general building AI tools today than it was two, three years ago. People kind of get it and so that user education, um, there’s just, it was our most successful launch in terms of free trials and converting people and things like that. It was really successful, so yeah. But there’s a lot to build. [00:01:12] swyx: Making it free for three months helps. [00:01:16] Sarah Sachs: Yep. [00:01:17] Simon Last: It was definitely super exciting for me because it’s probably the fourth or fifth time that we rebuilt that. [00:01:22] swyx: Yes. [00:01:23] Simon Last: And I mean, [00:01:24] swyx: you’ve been building this since like 20, 22. [00:01:26] Simon Last: Yeah, I mean, like, it was even right when we got access to like GPT four in late 20 22, 1 of the first ideas we had is like, oh, okay, let’s make an agent that I, we used the word assistant at the time, there wasn’t really the word, the word agent yet, but, oh, we’ll give an access to all the tools the notion can do, and then it, we run in the background like, like do work for us. And then we just tried that many times and it just. Was too early. Um, [00:01:48] swyx: I need to force you to like double click on that. What is too early? What didn’t work? [00:01:52] Sarah Sachs: We were fine to, like, before function calling came out. We were trying to fine tune with the Frontier Labs and with fireworks, like a function calling model on notion functions. This is right when I joined. I joined because, um, we needed a manager as Simon was needed to be able to go on vacation. So, uh, that’s, that’s around when I joined, so you can speak much more to it. [00:02:11] Simon Last: Yeah, we did partnerships wit
We’re proud to release this ahead of Ryan’s keynote at AIE Europe. Hit the bell, get notified when it is live! Attendees: come prepped for Ryan’s AMA with Vibhu after. Move over, context engineering. Now it’s time for Harness engineering and the age of the token billionaires. Ryan Lopopolo of OpenAI is leading that charge, recently publishing a lengthy essay on Harness Eng that has become the talk of the town: In it, Ryan peeled back the curtains on how the recently announced OpenAI Frontier team have become OpenAI’s top Codex users, running a >1m LOC codebase with 0 human written code and, crucially for the Dark Factory fans, no human REVIEWED code before merge. Ryan is admirably evangelical about this, calling it borderline “negligent” if you aren’t using >1B tokens a day (roughly $2-3k/day in token spend based on market rates and caching assumptions): Over the past five months, they ran an extreme experiment: building and shipping an internal beta product with zero manually written code. Through the experiment, they adopted a different model of engineering work: when the agent failed, instead of prompting it better or to “try harder,” the team would look at “what capability, context, or structure is missing?” The result was Symphony, “a ghost library” and reference Elixir implementation (by Alex Kotliarskyi) that sets up a massive system of Codex agents all extensively prompted with the specificity of a proper PRD spec, but without full implementation: The future starts taking shape as one where coding agents stop being copilots and start becoming real teammates anyone can use and Codex is doubling down on that mission with their Superbowl messaging of “you can just build things”. Across Codex, internal observability stacks, and the multi-agent orchestration system his team calls Symphony, Ryan has been pushing what happens when you optimize an entire codebase, workflow, and organization around agent legibility instead of human habit. We sat down with Ryan to dig into how OpenAI’s internal teams actually use Codex, why the real bottleneck in AI-native software development is now human attention rather than tokens, how fast build loops, observability, specs, and skills let agents operate autonomously, why software increasingly needs to be written for the model as much as for the engineer, and how Frontier points toward a future where agents can safely do economically valuable work across the enterprise. We discuss: * Ryan’s background from Snowflake, Brex, Stripe, and Citadel to OpenAI Frontier Product Exploration, where he works on new product development for deploying agents safely at enterprise scale * The origin of “harness engineering” and the constraint that kicked off the whole experiment: Ryan deliberately refused to write code himself so the agent had to do the job end to end * Building an internal product over five months with zero lines of human-written code, more than a million lines in the repo, and thousands of PRs across multiple Codex model generations * Why early Codex was painfully slow at first, and how the team learned to decompose tasks, build better primitives, and gradually turn the agent into a much faster engineer than any individual human * The obsession with fast build times: why one minute became the upper bound for the inner loop, and how the team repeatedly retooled the build system to keep agents productive * Why humans became the bottleneck, and how Ryan’s team shifted from reviewing code directly to building systems, observability, and context that let agents review, fix, and merge work autonomously * Skills, docs, tests, markdown trackers, and quality scores as ways of encoding engineering taste and non-functional requirements directly into context the agent can use * The shift from predefined scaffolds to reasoning-model-led workflows, where the harness becomes the box and the model chooses how to proceed * Symphony, OpenAI’s internal Elixir-based orchestration layer for spinning up, supervising, reworking, and coordinating large numbers of coding agents across tickets and repos * Why code is increasingly disposable, why worktrees and merge conflicts matter less when agents can resolve them, and what it really means to fully delegate the PR lifecycle * “Ghost libraries”, spec-driven software, and the idea that a coding agent can reproduce complex systems from a high-fidelity specification rather than shared source code * The broader future of Frontier: safely deploying observable, governable agents into enterprises, and building the collaboration, security, and control layers needed for real-world agentic work Ryan Lopopolo * X: https://x.com/_lopopolo * Linkedin: https://www.linkedin.com/in/ryanlopopolo/ * Website: https://hyperbo.la/contact/ Timestamps 00:00:00 Introduction: Harness Engineering and OpenAI Frontier00:02:20 Ryan’s background and the “no human-written code” experiment00:08:48 Humans as the bottleneck: systems thinking, observability, and agent workflows00:12:24 Skills, scaffolds, and encoding engineering taste into context00:17:17 What humans still do, what agents already own, and why software must be agent-legible00:24:27 Delegating the PR lifecycle: worktrees, merge conflicts, and non-functional requirements00:31:57 Spec-driven software, “ghost libraries,” and the path to Symphony00:35:20 Symphony: orchestrating large numbers of coding agents00:43:42 Skill distillation, self-improving workflows, and team-wide learning00:50:04 CLI design, policy layers, and building token-efficient tools for agents00:59:43 What current models still struggle with: zero-to-one products and gnarly refactors01:02:05 Frontier’s vision for enterprise AI deployment01:08:15 Culture, humor, and teaching agents how the company works01:12:29 Harness vs. training, Codex model progress, and “you can just do things”01:15:09 Bellevue, hiring, and OpenAI’s expansion beyond San Francisco Transcript Ryan Lopopolo: I do think that there is an interesting space to explore here with Codex, the harness, as part of building AI products, right? There’s a ton of momentum around getting the models to be good at coding. We’ve seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you’re trying to. Build a user journey that you’re trying to solve into code. It’s pretty natural to use the Codex Harness to solve that problem for you. It’s done all the wiring and lets you just communicate in prompts. To let the model cook, you have to step back, right? Like you need to take a systems thinking mindset to things and constantly be asking, where is the Asian making mistakes? Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I’m putting in place. So I have solved this part of the SDLC. swyx: [00:01:00] All right. [00:01:03] Meet Ryan swyx: We’re in the studio with Ryan from OpenAI. Welcome. Ryan Lopopolo: Hi, swyx: Thanks for visiting San Francisco and thanks for spending some time with us. Ryan Lopopolo: Yeah, thank you. I’m super excited to be here. swyx: You wrote a blockbuster article on harness engineering. It’s probably going to be the defining piece of this emerging discipline, huh? Ryan Lopopolo: Thank you. It is it’s been fun to feel like we’ve defined the discourse in some sense. swyx: Let’s contextualize a little bit, this first podcast you’ve ever done. Yes. And thank you for spending with us. What is, where is this coming from? What team are you in all that jazz? Ryan Lopopolo: Sure, sure. Ryan Lopopolo: I work on Frontier Product Exploration, new product development in the space of OpenAI Frontier, which is our enterprise platform for deploying agents safely at scale, with good governance in any business. And. The role of VMI team has been to figure out novel ways to deploy our models into package and products that we can sell as solutions to enterprises. swyx: And you have a background, I’ll just squeeze it in there. Snowflake, brick, [00:02:00] stripe, citadel. Ryan Lopopolo: Yes. Yes. Same. Any kind of customer swyx: entire life. Yes. The exact kind of customer that you want to, Vibhu: so I’ll say, I was actually, I didn’t expect the background when I looked at your Twitter, I’m seeing the opposite. Stuff like this. So you’ve got the mindset of like full send AI, coding stuff about slop, like buckling in your laptop on your Waymo’s. Yes. And then I look at your profile, I’m like, oh, you’re just like, you’re in the other end too. Oh, perfect. Makes perfect. Ryan Lopopolo: I it’s quite fun to be AI maximalist if you’re gonna live that persona. Open eye is the place to do it. And it’s swyx: token is what you say. Ryan Lopopolo: Yeah. Certainly helps that we have no rate limits internally. And I can go, like you said, full send at this stay. swyx: Yeah. Yeah. So the Frontier, and you’re a special team within O Frontier. Ryan Lopopolo: We had been given some space to cook, which has been super, super exciting. [00:02:47] Zero Code Experiment Ryan Lopopolo: And this is why I started with kind of a out there constraint to not write any of the code myself. I was figuring if we’re trying to make agents that can be deployed into end to enterprises, they should be [00:03:00] able to do all the things that I do. And having worked with these coding models, these coding harnesses over 6, 7, 8 months, I do feel like the models are there enough, the harnesses are there enough where they’re isomorphic to me in capability and the ability to do the job. So starting with this constraint of I can’t write the code meant that the only way I could do my job was to get the agent to do my job. Vibhu: And like a, just a bit of background before that. This is basically the article. So what you guys did is five months of working on an in
Fresh off raising a monster $15B, Marc Andreessen has lived through multiple computing platform shifts firsthand, from Mosaic and Netscape to cofounding A16z. In this episode, Marc joins swyx and Alessio in a16z’s legendary Sand Hill Road office to argue that AI is not just another hype cycle, but the payoff of an “80-year overnight success”: from neural nets and expert systems to transformers, reasoning models, coding, agents, and recursive self-improvement. He lays out why he thinks this moment is different, why AI is finally escaping the old boom-bust pattern, and why the real bottleneck may be less about models than about the messy institutions, incentives, and social systems that struggle to absorb technological change. This episode was a dream come true for us, and many thanks to Erik Torenberg for the assist in setting this up. Full episode on YouTube! We discuss: * Marc’s long view on AI: from the 1980s AI boom and expert systems to AlexNet, transformers, and why he sees today’s moment as the culmination of decades of compounding technical progress * Why “this time is different”: the jump from LLMs to reasoning, coding, agents, and recursive self-improvement, and why Marc thinks these breakthroughs make AI real in a way prior cycles were not * AI winters vs. “80-year overnight success”: why the field repeatedly swings between utopianism and doom, and why Marc thinks the underlying researchers were mostly right even when the timelines were wrong * Scaling laws, Moore’s Law, and what to build: why he believes AI scaling laws will continue, why the outside world is messier than lab purists assume, and how startups can still create durable value on top of rapidly improving models * The dot-com crash and AI infrastructure risk: Marc’s comparison between today’s AI capex boom and the fiber/data-center overbuild of 2000, plus why he thinks this cycle is different because the buyers are huge cash-rich incumbents and demand is already here * Why old NVIDIA chips may be getting more valuable: the pace of software progress, chronic capacity shortages, and the idea that even current models are “sandbagged” by supply constraints * Open source, edge inference, and the chip bottleneck: why Marc thinks local models, Apple Silicon, privacy, trust, and economics all point toward a major role for edge AI * American vs. Chinese open source AI: DeepSeek as a “gift to the world,” why open models matter not just because they’re free but because they teach the world how things work, and how open source strategies may shift as the market consolidates * Why Pi and OpenClaw matter so much: Marc’s claim that the combination of LLM + shell + filesystem + markdown + cron loop is one of the biggest software architecture breakthroughs in decades * Agents as the new “Unix”: how agent state living in files allows portability across models and runtimes, and why self-modifying agents that can extend themselves may redefine what software even is * The future of coding and programming languages: why Marc thinks software becomes abundant, why bots may translate freely across languages, and why “programming language” itself may stop being a salient concept * Browsers, protocols, and human readability: lessons from Mosaic and the web, why text protocols and “view source” mattered, and how similar principles may shape AI-native systems * Real-world OpenClaw use: health dashboards, sleep monitoring, smart homes, rewriting firmware on robot dogs, and why the most aggressive users are discovering both the power and danger of agents first * Proof of human vs. proof of bot: why Marc thinks the internet’s bot problem is now unsolvable via detection alone, and why biometric + cryptographic proof of human becomes necessary Timestamps * 00:00 Marc on AI’s “80-Year Overnight Success” * 00:01 A Quick Message From swyx * 01:44 Inside a16z With Marc Andreessen * 02:13 The Truth About a16z’s AI Pivot * 03:29 Why This AI Boom Is Not Like 2016 * 06:33 Marc on AI Winters, Hype Cycles, and What’s Different Now * 10:09 Reasoning, Coding, Agents, and the New AI Breakthroughs * 12:13 What Founders Should Build as Models Keep Improving * 16:33 AI Capex, GPU Shortages, and the Dot-Com Crash Analogy * 24:54 Open Source AI, Edge Inference, and Why It Matters * 33:03 Why OpenClaw and PI Could Change Software Forever * 41:37 Agents, the End of Interfaces, and Software for Bots * 46:47 Do Programming Languages Even Have a Future? * 54:19 AI Agents Need Money: Payments, Crypto, and Stablecoins * 56:59 Proof of Human, Internet Bots, and the Drone Problem * 01:06:12 AI, Management, and the Return of Founder-Led Companies * 01:12:23 Why the Real Economy May Resist AI Longer Than Expected * 01:15:53 Closing Thoughts Transcript Marc: Something about AI that causes the people in the field, I would say, to become both excessively utopian and excessively apocalyptic. Having said that, I think what’s actually happened is an enormous amount of technical progress that built up over time. And like for, for example, we now know that neural network is the correct architecture.And I, I will tell you like there was a 60 year run where that was like a, you know, or even 70 years where that was controversial. And so, so the way I think about what’s happening is basically, I think, I think about basically the, the, the period we’re in right now is it’s, I call it 80 year overnight success, right?Which is like, it’s an overnight success ‘cause it’s like bam, you know, chat GPT hits and then, and then oh one hits, and then, you know, open claw hits and like, you know, these are open, these are, these are like overnight, like radical, overnight transformative successes, but they’re drawing on an 80 year sort of wellspring backlog, you know, of, of, of, of ideas and thinking it’s not just that it’s all brand new, it’s that it’s an unlock of all of these decades of like very serious, hardcore research.If I were 18, like this is a hundred, this is what I would be spending all of my time on. This is like such an incredible conceptual breakthrough.swyx: Before we get into today’s episode, I just have a small message for listeners. Thank you. We will not be able to bring you the ai, engineering, science, and entertainment contents that you so clearly want if you didn’t choose to also click in and tune into our content.We’ve been approached by sponsors on an almost daily basis, but fortunately enough of you actually subscribed to us to keep all this sustainable without ads, and we wanna keep it that way. But I just have one favor to ask all of you. The single, most powerful, completely free thing you can do is to click that subscribe button.It’s the only thing I’ll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring the in space to you each and every week. If you do it, I promise you will never stop working to make the show even better. Now, let’s get into it.Alessio: Hey everyone, welcome to the Lidian Space Pockets. This is CIO, founder Kernel Labs, and I’m joined by s Swix, editor of Lidian Space.swyx: Hello. And we’re in a 16 Z with a, uh, mark G and welcome.Marc: Yes, yes. A and what, half of 16? Something like that. A one. Exactly,swyx: exactly. Uh, apparently this is the, the final few days in your, your current office.You’re moving across the road.Marc: Uh, we’re, yeah. We have a, we have some, we have some projects underway, but yeah, this is actually, oh, this is the original. We’re in actually the original office. We’re in the, we’re in the, we’re, we’re in the whole thing.swyx: It’s beautiful. Yeah. Great.Marc: Thank you.swyx: So I have to come out, uh, this is a, you know, I wanted to pick a spicy start in October, 2022.I just made friends with Roone and, uh, I wanted to give him something to sort of be spicy about. And I said, uh. Uh, it’ll never not be funny. The A 16 Z was constantly going. The future is where the smart people choose to spend their time and then going deep into crypto and not in ai. And that was in October 22nd, 2022.And Ruen says there was an internal meeting in a 16 Z to reorient around Gen ai. Obviously you have, but was there a meeting? What, what was that?Marc: I mean, I don’t, look, I’ve been doing AI since the late eighties.swyx: Yeah.Marc: So I, I don’t know, like all that, as far as I’m concerned, this stuff is all Johnny cum lately.Yeah. You, I mean, look, we’ve been doing ar entire existence. I mean, we’ve been doing AI machine learning deep, you know, deeply. We’ve been doing this stuff way from the beginning. Obviously a AI is just core to computer science. I, I, I actually view them as like quite, uh, quite continuous. Um, you know, Ben and I both have computer science degrees.Um, you know, we, we both, Ben, Ben and I actually both are world enough to remember the actual AI boom in the 1980s. Yeah. There was like a, there was a big AI boom at the time. Um, and there was a, was names like expert systems. Um, and they of like lisp and lisp machines. Uh, I, I coded in lisp. I was coding a lisp in 1989.When that was the, the language of the AI future. Um, yeah. So this is something that we’re like completely, you completely comfortable with. I’ve been doing the whole time and are very enthusiastic aboutswyx: is there a strong, like this time is different because, uh, my closest analog was 20 16 17. It was an AI boom.Mm-hmm. And it petered out very, very quickly. Um, we, it just, it just in terms of investingMarc: sort of, sort of,swyx: yeah. Investment, investment excitement.Marc: Although that’s really when the, the, the Nvidia phenomenon really, it was, I would say it was in that period when it was very clear that at, at the time it, the vocabulary was more machine learning, but it, it was very clear at that time that machine learning was hitting some sort of takeoff p
We’ve been on a bit of a mini World Models series over the last quarter: from introducing the topic with Yi Tay, to exploring Marble with World Labs’ Fei-Fei Li and Justin Johnson, to previewing World Models learned from massive gaming datasets with General Intuition’s Pim de Witte (who has now written down their approach to World Models with Not Boring), to discussing the Cosmos World Model with with Andrew White of Edison Scientific on our new Science pod, to writing up our own theses on Adversarial World Models. Meanwhile Nvidia, Waymo and Tesla have published their own approaches, Google has released Genie 3, and Yann LeCun has raised $1B for AMI and published LeWorldModel. Today’s guests have a radically different approach to World Modeling to every player we just mentioned — while Genie 3 is impressive, its many flaws demonstrate the issues with their approach - terrain clipping, noninteractivity (single player, no physics/no objects other than the player move), and maximum of 60 second immersion. Moonlake AI (inspired by the Dreamworks logo) is the diametric opposite - immediately multiplayer, incredibly interactive, indefinite lifetime, capable of MANY different kinds of world models by simulating environments, predicting outcomes, and planning over long horizons. This is enabled by bootstrapping from game engines and training custom agents: In Towards Efficient World Models, Chris Manning and Ian Goodfellow join Fan-Yun in explaining why their approach to efficiency with structure and casuality instead of just blind scaling is sorely needed: SOTA models still show physical or spatial understanding glitches, such as solid objects floating in mid-air or moving “inside” other solid objects. If the goal is to plan for the next action, how often is a high-resolution pixel view necessary for modeling the world? Our bet is that there is a disproportionately large share of economically valuable tasks where such detail is not required. After all, humans with a wide variety of sensory limitations have little difficulty doing almost everything in the world. Furthermore, for a large number of purposes, describing a scene or a situation in a few words of language (“the car’s tires squealed as it cornered sharply”) is sufficient for understanding and planning. Experiments also show that humans only partially process visual input in a top-down, task-directed way, often making use of abstracted object-level modeling. In almost all cases, partial representations combined with semantic understanding are sufficient. …If the goal is to facilitate the understanding of causality in multimodal environments, then the world model—whether it is used in the virtual world or the physical world—must prioritize properties such as spatial and physical state consistency maintained over long time periods, and an ability to evolve the world that accurately reflects the consequences of actions. That’s what Moonlake is building. Game engines are the right starting point abstraction to efficiently extract causal relationships, and building the interfaces and community (including their new $30,000 Creator Cup) to kickstart the flywheel of actions-to-observations. We were fortunate enough to attend their sessions at GDC 2026 (the Mecca of Game Devs), and were impressed by the huge variety and flexibility of the worlds people were building with Moonlake’s tools already! Live videos on the pod. Full Video Pod on YouTube! Timestamps 00:00 Benchmarking Gets Hard00:47 Meet Moonlake Founders01:26 Why Build World Models03:12 Structure Not Just Scale05:37 Defining Action Conditioned Worlds07:32 Abstraction Versus Bitter Lesson14:39 Language Versus JEPA Debate20:27 Reasoning Traces And Rendering Layer37:00 Gameplay Over Graphics38:02 Fiction Rules And World Tweaks39:15 Code Engines Beat Learned Priors41:10 Diffusion Scaling Limits43:23 Symbolic Versus Diffusion Boundary46:14 Platform Vision Beyond Games50:24 Spatial Audio And Multimodal Latents54:23 NLP Roots Hiring And Moon Lake Name Transcript [00:00:00] Cold Open [00:00:00] Chris Manning: Think this whole space is extremely difficult as things are emerging now. And I mean, it’s not only for world models, I think it’s for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks. [00:00:20] But these days so much of what people are wanting to do is nothing like that, right? You’re wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It’s not so easy to come up with a benchmark, and it’s the same problem with these world models. [00:00:41] Meet the Founders [00:00:41] swyx: Okay. We’re back in the studio with Moon Lake’s, two leads. I, I guess there’s other founders as well, but, sun and Chris Manning. Welcome to the studio. [00:00:54] Fan-yun Sun: Thanks. Thanks, Chris. Thanks for having us. [00:00:56] swyx: You’ve got, you guys have, come burst onto the scene with a really refreshing [00:01:00] new take of mold models. [00:01:01] I would just want to, I guess ask how you, the two of you came together. Chris, you’re a legend in NLP and just AI in, in, in general. You’re, you’re his grad student, I guess [00:01:10] Fan-yun Sun: Actually my co-founder. [00:01:11] swyx: Oh, yeah. [00:01:12] Fan-yun Sun: I should give a lot of credit to my co-founder, Sharon. Yeah. She was, she was actually working with Professor Fe Androgyn and then she ended up working with, Ron and Chris Manning here. [00:01:22] And then, so I got connected through to Chris initially, actually through my co-founder, [00:01:26] What is Moon Lake? [00:01:26] swyx: what is Moon Lake? What, what is, actually, I’m also very curious about the name, but like why going into world models? [00:01:33] Fan-yun Sun: So I was working a lot. With actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embody EA agents. [00:01:44] And then there’s two observations. One in academia and one in industry. An industry like folks at Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it’s for the sake of evaluation or training the robots, or policies or models. And [00:02:00] then, in academia, same thing is happening. [00:02:02] And more specifically, when I was actually working with Nvidia on the synthetic data foundation model training project, we were actually generating a lot of these synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real world data when it comes to multimodal pre-training. [00:02:16] But then, like I said, there’s a lot of dollars being paid out to like external vendors or, or like. Other folks to manually curate these types of data. It was very clear to us that, okay, on our way to, let’s call it embody general intelligence models need to learn the consequences behind their actions, which means that they need interactive data and the demand for those types of data are growing exponentially. [00:02:38] But everybody’s sort of thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that’s a little bit on the genesis of Moon Lake, and I think the reason I got into world models was partly. [00:02:59] A philosophical [00:03:00] take of the on the world where I like, believe the simulation theory and stuff like that. But on the other, on the other hand, it’s really just like, oh, like there’s an opportunity there that I feel like nobody’s doing it the way I think should be done. [00:03:10] Structure, Not Scale: The Vision [00:03:10] Chris Manning: I can say a little bit about that. [00:03:12] Yeah. So of the overall goal is the pursuit of artificial intelligence and most of my career has been doing that in the language space and that’s been just extremely productive. As we all know, the story of the last few years, I don’t have to tell about how much we’ve achieved with large language models, but, uh. [00:03:31] Although they have been extremely effective for ramping language and general intelligence, it’s clearly not the whole world. There’s this multimodal world of vision, sound, taste that you’d like to be dealing with more than just, language. And then the question is how to do it. And despite, a huge investment in the computer vision space, right, as the research field computer [00:04:00] vision has been for decades, far, far larger than the language space, actually. [00:04:05] I think it’s fair. Say that, vision, understanding sort of stalled out, right? You got to object recognition and then progress just wasn’t being made right? If you look at any of these, vision language models, it’s the language that’s doing 90% of the work and the vision barely works. And so there’s really an interesting research question as to why that is and at heart, the ideas behind Moon Lake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren’t in the mainstream vision models, which are still trying to operate on the surface level of pixels. [00:04:50] swyx: I think one of your blog posts, you put it as structure, not scale. Is that, a general thesis? [00:04:57] Chris Manning: Yeah. Well, scale is good too. [00:04:58] swyx: Yeah. Scale is good. Too [00:04:59] lot, [00:04:59] Chris Manning: [00:05:00] lots of data is good as well and scale, but nevertheless, you want the structure Yeah. To be able to much more efficiently learn. [00:05:07] swyx: Yeah. The other thing I really liked also is you put
Mistral has been on an absolute tear - with frequent successful model launches it is easy to forget that they raised the largest European AI round in history last year. We were long overdue for a Mistral episode, and we were very fortunate to work with Sophia and Howard to catch up with Pavan (Voxtral lead) and Guillaume (Chief Scientist, Co-founder) on the occasion of this week’s Voxtral TTS launch: Mistral can’t directly say it, but the benchmarks do imply, that this is basically an open-weights ElevenLabs-level TTS model (Technically, it is a 4B Ministral based multilingual low-latency TTS open weights model that has a 68.4% win rate vs ElevenLabs Flash v2.5). The contributions are not just in the open weights but also in open research: We also spend a decent amount of the pod talking about their architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens (typically only applied in the Image Generation space, as seen in the Flow Matching NeurIPS workshop from the principal authors that we reference in the pod). You can catch up on the paper here and the full episode is live on youtube! Timestamps 00:00 Welcome and Guests00:22 Announcing Voxtral TTS01:41 Architecture and Codec02:53 Understanding vs Generation05:39 Flow Matching for Audio07:27 Real Time Voice Agents13:40 Efficiency and Model Strategy14:53 Voice Agents Vision17:56 Enterprise Deployment and Privacy23:39 Fine Tuning and Personalization25:22 Enterprise Voice Personalization26:09 Long-Form Speech Models26:58 Real-Time Encoder Advances27:45 Scaling Context for TTS28:53 What Makes Small Models30:37 Merging Modalities Tradeoffs33:05 Open Source Mission35:51 Lean and Formal Proofs38:40 Reasoning Transfer and Agents40:25 Next Frontiers in Training42:20 Hiring and AI for Science44:19 Forward Deployed Engineering46:22 Customer Feedback Loop48:29 Wrap Up and Thanks Transcript swyx: Okay, welcome to Latent Space. We’re here in the studio with our gues co-host Vibh u. Welcome. Thanks. Excited for this one as well as Guillaume and Pavan from Mistral. Welcome. Excited to be here. Guillaume: Thank you. swyx: Pavan, you are leading audio research at Mistral and Guillaume, you're Chief Scientist, Announcing Voxtral TTS swyx Host (00:05) Okay. (00:05) Welcome to Lean Space. (00:06) We’re here in the studio with trustee co-hosts, Vibhu. (00:09) Welcome. Vibhu Host (00:11) Very excited for this one. swyx Host (00:12) As well as Guillaume and Pavan from Mistral. (00:15) Welcome. (00:16) Excited to be here. (00:17) Thank you for having us. (00:18) Pavan, you are leading audio research at Mistral and Guillaume, you’re a chief scientist. (00:23) What are we announcing today where we’re coordinating this release with you guys? Guillaume Guest (00:26) Yeah, so we are releasing Voxtral TTS. So it’s our first audio model that generates speech. It’s not our first audio model. We had a couple of releases before. (00:35) We had one in the summer that was Voxtral, our first audio model, but it was like a transcription model, ASR. Like a few months later, we released some update on top of this, supporting more languages. Also a lot of table stack features for our customers, context biasing, precision, timestamping and transcription. We also have some real-time model that can transcribe not just at the end of the level. (00:56) You don’t need to fill your entire audio file, but that can also come in real-time. And here, this is a natural extension in the audio, so basically speech generation. So yeah, so we support nine languages, and this is a pretty small model, 3D model, so very fast, and also state of the art. Performed at the same level as the base model, but it’s much more efficient in terms of cost, and also much, in terms of cost, it’s also much cheaper, only a fraction of the cost of our competitors. (01:22) And we are also releasing the work that this model is running. swyx What’s the decision factor? Guillaume It’s a good question. swyx There will be more. Yeah, Pavan, any sort of research notes to add on? Architecture and Codec Pavan: But it’s a novel architecture that we develop inhouse. We traded on several internal architectures and ended up with a auto aggressive flow matching architecture. And also have a new in-house neural audio codec. Which, converts this audio into all point by herds latent [00:02:00] tokens, semantic and acoustic tokens. And yeah, that’s that’s their new part about this model and we’re pretty excited that it’s, it came out with such good quality and Jim was mentioning. Yeah, it’s a three B model. It’s based off of the TAL model that we actually released just a few months back and insert trunk and mainly meant for like the TTS stuff, but they need text capabilities are also there. Yeah. swyx: So there’s a lot to cover. I always I love any, anything to do with novel encodings and all those things because I think that’s obviously I creates a lot of efficiency, but also maybe bugs that sometimes happen. You were previously a Gemini and you worked on post training for language models, and maybe a lot of people will have less experience with audio models just in general compared to pure language. What did you find that you have to revisit from scratch as you joined this trial and started doing this? At least Understanding vs Generation Pavan: when it comes to, for, I think the, there are two buckets, I guess the audio understanding and audio [00:03:00] generation. The audio understanding, like the walkthrough models that Kim was mentioning that we released earlier. The walkthrough chat that we released I think July last year, and the follow up transcription only, models family that we released in January, that would be one bucket, and the generation is another bucket. I think. You can also treat them as a unified set of models, but currently the approaches are a little different between these two. To your question on how audio is fed to the model? In the understanding model, it’s very similar to actually Pixar models that we also released, swyx: yes. Pavan: That’s swyx: amazing. Pavan: It was pretty, I, that was the first project I worked on after joined Misra. It was pretty, pretty nice. And Wtu was very similar in spirit. I guess So we feed audio through an audio encoder similar to images through a vision encoder, and it produces continuous embeddings and which are fed as tokens to the main transformer decoded transformer model. Yeah. On the model output is just text. So on the output side, there is nothing that needs to be done in these kinds of mode. I [00:04:00] guess the interesting part of what the generation stuff is, the output now has to produce audio and. The approach that we have is this neural audio codec, which converts audio into these latent tokens. There is a lot of existing attrition and a lot of models which are based off of this kind of approach. And we took a slightly. A different, design decisions around this. But at the end of the day, the neural audio product converts audio into a 12.5 herdz set of latents. And each latent is, has a semantic token and a set of acoustic tokens. And the idea is that you take these discrete tokens and then feed it on the input side. There’s several ways to use this at each frame, but we just sum the embedding. So it’s like having key different vocabularies. Combine all of them because they all correspond to one audio frame on the input side. The output side is the interesting part on the output side, the, it’s not the, I don’t know if it’s the most popular, but one. Popular technique is to have a depth transformer [00:05:00] because you have K tokens at each time step, like with a text, you just have one token at each time step. So you just do predict the token from the vocabulary with, yeah, with just, you get probability swyx: This’s a very straightforward text. Very Pavan: straightforward. swyx: Yeah. Pavan: But if you have K tokens, then the name thing would be to predict all of them in paddle. That doesn’t work. At least that doesn’t work that well because audio has more entropy. And the, one of the techniques people use is this depth transformer where you you almost have a small transformer, or it can be L-S-T-M-R in as well, but people use transformers and you predict the K tokens in auto aggressive fashion in that. So you have two auto reive things going on. Flow Matching for Audio Pavan: So the thing we did differently is in, instead of having this auto aggressive K step prediction, we have a flow matching model. Instead of modeling this as a discrete token set we trained the codec to be both discrete and continuous to have this flexibility. So we did try the discrete stuff too, and which it works well, but the continuous stuff works just better. So yeah, we took this flow matching, so the, it’s a flow [00:06:00] matching head, which takes the latent from the main transformer and like kind in fusion, it’s denoising, but in this flow matching itself, velocity estimate. So you go from this noise t all the way to there. Audio latent, which corresponds to the 80 millisecond audio and then, which is sent through the work order to get back the 80 millisecond audio frame. swyx: Yeah. Is this the first application of flow matching in audio? Because usually I come across this in the image. Pavan: Yeah. Actually, in some sense there are models flow matching models in audio, but I think this specific combination I could be wrong. There could be somewhat. No. I haven’t seen. I haven’t seen much work in this, so I think it’s novel and a lot of it’s just a way bigger community, so they, I think they pioneer a lot of these diffusion flow matching work, and it’s interesting to adopt some of the ideas there into audio and, swyx: yeah. Pavan: Yeah, I’m, personally that’s the think part which is trying out about. One of more meta point is unlike text, even
Materials science is the unsung hero of the science world. Behind every physical product you interact was decades of research into getting the properties of materials just right. Your gym clothes contain synthetic fibers developed over decades. The glass screen, diodes, and chip substrate technology needed to read this blog post were only viable due to many teams of material scientists. Our guest Prof. Heather Kulik was one of the first material scientists to realize that there was alpha in combining computational tools with data driven modeling — she did AI for science before it was cool. She has a hard-fought perspective for how to succeed in this field. Yes, she believes the wins are real. To get there you must work hard to deeply integrate domain expertise with AI techniques, and also maintain a discriminating mind. Ultimately what matters is you succeed in the lab, and nature doesn’t care about how hyped a model is. These lessons personally resonated with the Latent.Space Science team and our own experience. This episode is a must watch for all aspiring AI for science practitioners. A few highlights: Designing new polymers with AI: Heather’s group recently used AI to design new polymers that are significantly stronger. These materials were created and tested in the lab, and the scientists who built them were surprised by the designs. The AI had figured out certain building blocks could break in a novel way. The AI discovered a purely quantum mechanical effect, and after convincing their lab collaborators to actually synthesize it, the material turned out to be four times tougher! The twenty-two-atom ligand challenge: When asked about the role and need of human scientists, Heather points out that AI has a strong understanding of academic chemistry, but is still lacking intuition. Every time an LLM is updated, Heather asks it to design a ligand that contains exactly twenty-two heavy atoms. She has yet to find one that can succeed at this seemingly simple task that any expert could do in a second! Is this the chemistry counterpart to counting ‘r’s in strawberry? Side note: Heather joked that this comment would date itself immediately, so we decided to see if this was still true three months after recording. We found some interesting results! We asked both Claude and ChatGPT to design a 22 atom ligand for both a metal-organic framework (MOF) and a Kinase protein. * For the Kinase, both models got it right: Claude pulled out RDKit in a python script and iterated on several designs, whereas ChatGPT just one-shotted it. * For MOFs, both models got it wrong, generating ligands with 21, 23, or 24 atoms, yet stubbornly not getting 22 atoms. Is there something different about how LLMs reason in the materials and bio domains? Materials vs biology: The two biggest domains of AI in science have been biology and materials. We asked Heather if there could be an AlphaFold moment for materials. Her answer reframes how we should think about the field: * First, the datasets in material science are woefully lacking in comparison to the bio world. The closest to ground truth in most cases are noisy DFT datasets. These are just approximations to the real world! The datasets that are accurate are all boring, as Heather quipped “We have really good datasets for really boring chemistry.” Furthermore, good experimental structures are hard to come by and require interpretation. So generating generating high-quality, novel datasets at scale would really drive the field forward. * More philosophically, AlphaFold is making predictions in a fairly limited space: there are just twenty amino acids. Sure, even here AlphaFold doesn’t get everything right, but it seems plausible that one could learn the entire design space. For materials, each element is a new set of interactions and chemistry, with little to no transferability. This is a massive open problem in material science that we hope some of the smartest AI scientists will want to work on! The difficulties of trusting the literature: Heather’s team has spent the last few years using NLP and later LLMs to extract data from literature. Even a few thousand data points from these papers can be valuable for guiding her group’s work. One surprising result: sometimes the reported values for a property (say temperature) do not match up with the graphs in the papers! So there’s lots of potential in using LLMs to mine data from the literature, just do it with care. The role of academia in an ever-changing world: One theme that has been running through many of our conversations has been the changing role of the academic — and the scientist — in science. When startups are raising $100s of millions and hyperscalers and Big Pharma are all ramping up AI-for-science efforts, the academic researcher needs both resources and judgement about problems to chase more than ever. Resources include data that is organized for machine learning, access to high throughput experimentation labs, and compute resources. These are all things that academics can build together. More importantly, Heather emphasizes curiosity about problems that haven’t hit the radar of the heavily capitalized AI companies. After so many years on the forefront of AI for Science, Heather’s judgement that Chemical Engineering and Material Science still need curious people asking questions with no clear path to money is a welcome beacon in the AI fog. Full Video podcast Is on Youtube! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Mar 23 update for Latent Spacenauts: this episode was recorded before the Dreamer team announced they were joining Meta Superintelligence Labs, and it turned out to be the last interview they did before the news became public. Consider this a snapshot from just before the transition! In 2024, David Singleton left Stripe and joined forces with Hugo Barra for a buzzy stealth startup named /dev/agents. This month they emerged out as Dreamer, a consumer-first platform to discover, build, and use AI agents and agentic apps, centered on a personal “Sidekick” that helps users customize experiences via natural language. Sidekick is nothing less than an “agent that builds agents”, with all the complexity that that entails: You’ve seen many many website builder, app builder, and even agent builder startups by now, but our favorite detail is the sheer amount of work that has gone into the “full stack” nature of the platform, including shipping their own SDK, logging, database, prompt management, serverless functions, and so on. Most platforms restrict the tech stack you can use just to get off the ground — Dreamer does it “right” by letting you push whatever arbitrary code you want to their VMs. Paying the Builders Of course former leaders of Stripe and Android would not stop at just building the tools, but also building the ecosystem. Dreamer is deeply aware of the 4 sided network effect it has going on and is ready to fund all of it. It’s time to Dream! Full Video Episode on youtube. Transcript [00:00:00] Meet Dreamer Purple [00:00:00] swyx: Okay, we’re here in the studio with David Singleton. Welcome. [00:00:08] David Singleton: Hey, Wix. It’s great to be here. [00:00:09] swyx: It’s great to have you. Uh, we have very sympa that your company color is the same as Lean Spaces color. [00:00:15] David Singleton: That’s right. Dreamer Purple. [00:00:17] swyx: It used to be Devrel agents, which I thought was very cool. It’s like you call back to Devrel Payments. [00:00:22] David Singleton: Yeah. [00:00:22] swyx: And you were obviously CTO Stripe. And talk to me about just the origin or thinking process behind Dreamer. Yeah. And maybe, maybe start with like, what, what is Dreamer? [00:00:31] David Singleton: Yeah. [00:00:31] What Is Dreamer [00:00:31] David Singleton: So Dreamer is a new product, uh, which everyone can come and play with today. Um, it’s a place where everyone, literally, everyone can discover, build, and enjoy and use AI agents and agenda apps. [00:00:45] And we really did design it for consumers, for folks who are not necessarily. Uh, have any kind of technical background. It’s really aimed at everyone. I think often of my sister, she’s very smart. She’s not in the slightest bit technical. She has lots of problems in her life that [00:01:00] she would like to be able to have great software and intelligent software to solve. [00:01:04] But you know, even with the rise of tools like Cloud Code and so forth, she’s got no way to get started. And Dreamer is a place where she can come in, grab some intelligent apps that other people in the community have built, start using them right away, and solve real problems in her life. [00:01:19] Sidekick And Waitlist [00:01:19] David Singleton: And at the core, we have a personal agent called the Sidekick. [00:01:24] Um, you can give your sidekick a name, you can give it its own personality, and it really helps you across your entire day, your life. It helps you use all of the agents on the platform, and it also helps you build anything you want. And we’ve been working in this for a little while. We recently launched in beta. [00:01:41] So anyone can go to dreamer.com, join the wait list. Um, and we have many, many, many people in the community now who are building really fun, really powerful, really useful. Agents and the agentic apps for themselves. [00:01:54] swyx: I think we’re gonna go right into a demo. Yeah. I just wanna make an observation that, uh, you, you, [00:02:00] you put discover first before build. [00:02:02] Mm-hmm. But actually, at least for the engineers in the audience. ‘cause we are primarily engineers and you’re primarily targeting consumers, right? [00:02:08] David Singleton: Yeah. [00:02:08] swyx: For engineers. Like, there’s a huge full stack of stuff, which we’re gonna dive into. Let’s write. It’s so impressive. I’m like, holy s**t, this, this is what I’ve always wanted. [00:02:16] Cool. Uh, so, so I think that’s really good and I’ve, in some ways, I think given your background given, uh, Hugo’s, is it Hugo? Hugo. [00:02:24] David Singleton: Hugo. Hugo Bar. Yeah. [00:02:25] swyx: Hugo, it’s not surprising that you can basically kind of build an app store Yeah. For agents. [00:02:30] David Singleton: Yeah. So Hugo was my co-founder. Yeah. Um, Hugo and I met with our other co-founder Nicholas Checkoff in the very early days of Android at Google, where we were building Google’s first mobile apps. [00:02:41] Uh, we then contributed to very core pieces of Android itself. And you’re right, we were really excited about building two things. One, solving a bunch of problems. That this breakthrough technology here I’m talking about mobile needed to have solved in order to make it work for real people at scale. And then secondly, building this ecosystem, um, [00:03:00] of third party developers using the Play Store, um, and able to deliver way more value on the platform than we could have delivered on our own. [00:03:08] And we think about Dreamer in exactly the same way. So I was working at Stripe, as you mentioned, and we had the opportunity to put some of the very first AI agent systems in the world into production. And from the moment we did the first of those, I was just struck with a strong sense of conviction that this is breakthrough technology that’s gonna change how all of us work with computers and phones and so forth, all of the, the technology in our lives, but. [00:03:34] There’s a lot of problems to be solved, for real people to be able to make this approachable. Um, and it really is kind of a direct analog for what we were solving back in the early days of mobile apps at Google and, and Android. So it’s, it’s been fun to bring that to life. [00:03:47] swyx: Yeah. Uh, let’s look at it. [00:03:48] David Singleton: Yeah, let’s take a look. [00:03:49] Dashboard And Daily Briefing [00:03:49] David Singleton: So, uh, dreamer.com, this is our homepage. This is where you can come and, uh, watch some videos about what is here and sign up for the wait list. Once [00:03:57] swyx: you, I, I just wanna say for those listening, ‘cause we have a lot, you [00:04:00] know, switch to YouTube, look at the animations. So much care. [00:04:03] David Singleton: We, we really care about, uh, this product being fun. [00:04:07] Uh, and, and interesting to use. Obviously a lot of people are using it to do real important stuff. You can do real work, uh, here, uh, but also you can build fun things too. Once you get off of our wait list, you’ll come into the product. The first thing that happens is you’ll have a conversation with your side cake, which is this little friendly, uh, character here. [00:04:27] And psychic will seek to get to know you and understand you. What do you care about? And will help you discover and build your first AI agents or agentic apps. After that, you’re, you’re gonna have a dashboard. This is my dashboard. Everyone’s is different. Um, you can see I have a few things here. I have a feed. [00:04:42] So a lot of our agents do things in the background when you’re not looking and the feed is how they let you know what they’ve been up to. I have, uh, some widgets, uh, from apps that I have built. Uh, this one is called Calendar Hero. Uh, this is something that I installed from the gallery. Uh, so built by someone in our community. [00:04:59] It’s a [00:05:00] really powerful calendar app because for each of my meetings, if it’s with someone I don’t already know, well it’ll actually go off and research it, um, and give me both a history of my interactions with those people and also a bunch of, you know, public useful information to, to get started. One of the things I love about this particular app is that every day it generates a podcast, um, a daily briefing. [00:05:24] And one of the things that we’ve done with the platform is we’ve made it possible for all the things that agents do to show up in places that you care about. So if you look over here, this is the screen in my phone, and if I go ahead and open my Apple Podcasts, you can see right here. Your Daily briefing podcast is ready. [00:05:39] This was produced by an agent running in my Dreamer account, and it was very easy by scanning a QR code to connect it to my Apple podcast. That’s what I listened to in the car now every morning. Yeah. On my way to work. [00:05:50] swyx: It, it [00:05:50] David Singleton: preps me for, for my day. [00:05:52] swyx: So one additional bit of context. I asked you immediately after seeing this was like, what, what about, I wanna talk back to my agent and you said you actually started with voice and then you went to [00:06:00] podcasts. [00:06:00] ‘cause it’s nice to have it pre downloaded [00:06:02] David Singleton: that, right? That’s right. Um, yeah, we, you, you can talk to your sidekick. So, you know, on mobile we have, uh, a dreamer app and you can talk to the sidekick right here. Um, but we’ve actually found that making things, uh, show up in the other apps that you already use in your life is incredibly powerful. [00:06:19] So let’s take a look at what’s kind of under the hood here. [00:06:21] Gallery Tools And Payouts [00:06:21] David Singleton: So I already mentioned that we have a gallery, so this is where you’ll find a lot of agents from our community. Uh, there’s. Many at this point, hundreds. And they ar
Claude Cowork came out of an accident. Felix and the Anthropic team noticed something interesting with Claude Code: many users were using it primarily for all kinds of messy knowledge work instead of coding. Even technical builders would use it for lots of non-technical work. Even more shocking, Claude cowork wrote itself. With a team of humans simply orchestrating multiple claude code instances, the tool was ready after a brief week and a half. This isn’t Felix’s first rodeo with impactful and playful desktop apps. He’s helped ship the Slack desktop app and is a core maintainer of Electron the open-source software framework used for building cross-platform desktop applications, even putting Windows 95 into an Electron app that runs on macOS, Windows, and Linux. In this episode, Felix joins us to unpack why execution has suddenly become cheap enough that teams can “just build all the candidates” and why the real frontier in AI products is no longer better chat, but trusted task execution. He also shares why Anthropic is betting on local-first agent workflows, why skills may matter more than most people realize, and how the hardest questions ahead are about autonomy, safety, portability, and the changing shape of knowledge work itself. We discuss * Felix’s path: Slack desktop app, Electron, Windows 95 in JavaScript, and now building Claude Cowork at Anthropic * What Claude Cowork actually is: a more user-friendly, VM-based version of Claude Code designed to bring agentic workflows to non-terminal-native users * Why “user-friendly” does not mean “less powerful”: Cowork as a superset product, much like how VS Code initially looked simpler than Visual Studio but became more hackable and extensible * Anthropic’s prototype-first culture: why Cowork was built in 10 days using many pre-existing internal pieces, and how internal prototypes shaped the final product * Why execution is getting cheap: the shift from long memos, specs, and debate toward rapidly building multiple candidates and choosing based on reality instead of theory * The local debate: why Felix thinks Silicon Valley is undervaluing the local computer, and why putting Claude “where you work” is often more powerful * Why Claude gets its own computer: the VM as both a safety boundary and a capability unlock, letting Claude install tools, run scripts, and work more independently without constant approval * Safety through sandboxing: why “approve every command” is not a real long-term UX, and how virtual machines create a middle ground between uselessly safe and dangerously autonomous * How Cowork differs from Claude Code: coding evals vs. knowledge-work evals, different system-prompt tradeoffs, longer planning horizons, and heavier use of planning and clarification tools * Why skills matter: simple markdown-based instructions as a lightweight abstraction layer for reusable workflows, personalized automation, and portable agent behavior * Skills vs. MCPs: why Felix is increasingly interested in file-based, text-native interfaces that tell the model what to do, rather than forcing everything through rigid tool schemas * The portability problem: why personal skills should move across agent products, and the unresolved tension between public reusable workflows and private user-specific context * Real use cases already happening today: uploading videos, organizing files, handling taxes, managing calendars, debugging internal crashes, analyzing finances, and automating repetitive browser workflows * Why AI products should work with your existing stack: Anthropic’s bias toward integrating with Chrome, Office, and existing workflows instead of rebuilding every app from scratch * Computer use one year later: how much better it has gotten, why vision plus browser context is such a superpower, and why letting Claude see the thing it is working on changes everything * Why many “AI verticals” may get compressed: specialized wrappers may matter in the short term, but better general models and stronger primitives could absorb a lot of narrow use cases * The future of junior work: Felix’s concerns about entry-level roles, labor-market disruption, and whether AI can compress early-career learning into denser simulated experience * Why Waterloo grads stand out: internships, shipping experience, and learning how real teams build products versus purely theoretical academic preparation * The agentic future of the desktop: what it means for Claude to have its own computer, whether AI should act on your machine or a remote one, and how intimacy with personal data changes the product design space * Why Electron still mattered: shipping Chromium as a controlled rendering stack, the limits of OS-native webviews, and why browser engines remain one of the great software abstractions * Anthropic’s Labs mentality: wild internal experiments, half-broken future-looking prototypes, and the broader effort to move users from asking questions to delegating increasingly long and valuable tasks * Why the endgame is not just more capability, but more independence: teaching users to trust AI with bigger scopes of work, for longer durations, with fewer interventions Felix Rieseberg * X: https://x.com/felixrieseberg * LinkedIn: https://www.linkedin.com/in/felixrieseberg * Website: https://felixrieseberg.com/ Anthropic * Website: http://anthropic.com Full Video Pod Timestamps 00:00 — Cheap execution and building all the candidates00:44 — Intro in the new Kernel studio02:47 — What Claude Cowork is04:18 — Why user-friendly can be more powerful05:33 — How Anthropic built Cowork07:09 — Prototype-first product development08:00 — Why local computers still matter09:20 — Skills, primitives, and platform leverage12:13 — Cowork’s architecture: VM + Chrome + system prompt15:38 — Felix’s own bug-fixing Cowork workflows17:38 — Local-first agents20:16 — Evals, planning, and knowledge-work optimization23:14 — What Anthropic means by evals24:21 — Scaffolding, tools, and why skills matter27:44 — Demo: YouTube uploads and self-generated skills31:03 — Calendar automation and cleaning your desktop34:47 — Browser context and why DOM access matters37:47 — Skills portability and plugins44:36 — Which AI categories survive?46:19 — Junior jobs, simulated work, and labor disruption52:00 — Gradual takeoff vs big-bang takeoff53:42 — Finance, taxes, and enterprise verticals56:24 — Vision and the improvement in computer use57:31 — Why Claude writes its own scripts58:06 — Should Claude have its own computer?1:01:26 — Windows 95 in JavaScript1:03:19 — VM tradeoffs and sandbox design1:07:23 — Approval fatigue and safe delegation1:11:18 — The future of Cowork1:12:27 — What comes next for agentic knowledge work1:15:13 — Electron, Chromium, and desktop software lessons1:22:16 — Multiplayer agents and coworker-to-coworker workflows1:26:05 — Anthropic Labs and closing thoughts Transcript Alessio: Hey everyone. Welcome to the Latent Space Podcast, our first one in the new studio. This is Alessio, founder of Kernel Labs, and I’m joined by swyx, editor of Latent Space. swyx: Yeah, so nice to be here. Thanks to, uh, TJ, Alessio, Allen helping to set everything up. It looks beautiful. We even have the logo outside. Yeah, kind. Felix: It’s like really nice, right? When you walk in here as a guest, you’re like, ah, this is a serious production. You’re like, feel it immediately. swyx: Yeah. Felix, you’ve been, you’re, you’re currently a product manager of Cowork or, Felix: uh, really Technic swyx: Eng. Yeah. The, the identities are kind of vague member technical staff. Felix: I know member staff is like, the official title will carry around forever. swyx: Yeah. I basically kind of wanted, like we’ve been. Kinda obsessed. I, I’ve been using it a lot, even for managing latent space. Like, uh, cowork helps me upload videos and like title things and like edit and everything. It’s, it’s like really amazing. Alessio: Cool. He said multiple times Cowork has said gi in the group track. swyx: Yeah, yeah, yeah. So, so we have a second, uh, we have a second channel, uh, for latent space tv. Uh, and I, uh, and uh, we basically, this is our Discord meetup. Um, and I I, we have like Claude Coworks, it might be a GI, I don’t know if we, we have, uh, uploaded it yet, but one of the sessions was like a, like a Claude cowork thing. Felix: I, you have to see, I would love to see it. Like, I’m so curious, like one of the most fun parts of my job is like constantly see the weird things people use Cowork for because it’s obviously like very hard for us to actually design for specific use cases we do. But like every single person who’s like most amazed is usually amazed about a thing that I didn’t even expect cowork would be good at. Um, we have a new designer and it’s one of the first small tasks. I was like, Hey, we need like a new emoji for cowork for our internal stock. It’s like a pretty small thing. I like, can you please do it? And he drew an SVG and just gave it to coworker was like, can you animate this emoji? And now it has like this beautiful loopy animation. Um, and I mean, I think obviously this goes down to like, it turns out you can do more things with code than you expected, but it, it’s like that kind of stuff that is really fun to me. So, long story short, I would love to see like, the kind of things you’re doing. swyx: I’ll pull it up. I’ll pull it up. Felix: Yeah. Yeah. swyx: Uh, but before we get into it, I, I think always wanna start with like a top level. What is Claude Cowork for people who haven’t heard of it? Haven’t tried it out. Felix: Okay. Uh, real quick, Claude Cowork is a user friendly version of Claude Code. So the way it basically works is we have Claude Code and for us, fairly impressive agent harness that over December we noticed more and more people are using either, eve
Turbopuffer came out of a reading app. In 2022, Simon was helping his friends at Readwise scale their infra for a highly requested feature: article recommendations and semantic search. Readwise was paying ~$5k/month for their relational database and vector search would cost ~$20k/month making the feature too expensive to ship. In 2023 after mulling over the problem from Readwise, Simon decided he wanted to “build a search engine” which became Turbopuffer. We discuss:• Simon’s path: Denmark → Shopify infra for nearly a decade → “angel engineering” across startups like Readwise, Replicate, and Causal → turbopuffer almost accidentally becoming a company • The Readwise origin story: building an early recommendation engine right after the ChatGPT moment, seeing it work, then realizing it would cost ~$30k/month for a company spending ~$5k/month total on infra and getting obsessed with fixing that cost structure • Why turbopuffer is “a search engine for unstructured data”: Simon’s belief that models can learn to reason, but can’t compress the world’s knowledge into a few terabytes of weights, so they need to connect to systems that hold truth in full fidelity • The three ingredients for building a great database company: a new workload, a new storage architecture, and the ability to eventually support every query plan customers will want on their data • The architecture bet behind turbopuffer: going all in on object storage and NVMe, avoiding a traditional consensus layer, and building around the cloud primitives that only became possible in the last few years • Why Simon hated operating Elasticsearch at Shopify: years of painful on-call experience shaped his obsession with simplicity, performance, and eliminating state spread across multiple systems • The Cursor story: launching turbopuffer as a scrappy side project, getting an email from Cursor the next day, flying out after a 4am call, and helping cut Cursor’s costs by 95% while fixing their per-user economics • The Notion story: buying dark fiber, tuning TCP windows, and eating cross-cloud costs because Simon refused to compromise on architecture just to close a deal faster • Why AI changes the build-vs-buy equation: it’s less about whether a company can build search infra internally, and more about whether they have time especially if an external team can feel like an extension of their own • Why RAG isn’t dead: coding companies still rely heavily on search, and Simon sees hybrid retrieval semantic, text, regex, SQL-style patterns becoming more important, not less • How agentic workloads are changing search: the old pattern was one retrieval call up front; the new pattern is one agent firing many parallel queries at once, turning search into a highly concurrent tool call • Why turbopuffer is reducing query pricing: agentic systems are dramatically increasing query volume, and Simon expects retrieval infra to adapt to huge bursts of concurrent search rather than a small number of carefully chosen calls • The philosophy of “playing with open cards”: Simon’s habit of being radically honest with investors, including telling Lachy Groom he’d return the money if turbopuffer didn’t hit PMF by year-end • The “P99 engineer”: Simon’s framework for building a talent-dense company, rejecting by default unless someone on the team feels strongly enough to fight for the candidate —Simon Hørup Eskildsen• LinkedIn: https://www.linkedin.com/in/sirupsen• X: https://x.com/Sirupsen• https://sirupsen.com/aboutturbopuffer• https://turbopuffer.com/ Full Video Pod Timestamps 00:00:00 The PMF promise to Lachy Groom00:00:25 Intro and Simon's background00:02:19 What turbopuffer actually is00:06:26 Shopify, Elasticsearch, and the pain behind the company00:10:07 The Readwise experiment that sparked turbopuffer00:12:00 The insight Simon couldn’t stop thinking about00:17:00 S3 consistency, NVMe, and the architecture bet00:20:12 The Notion story: latency, dark fiber, and conviction00:25:03 Build vs. buy in the age of AI00:26:00 The Cursor story: early launch to breakout customer00:29:00 Why code search still matters00:32:00 Search in the age of agents00:34:22 Pricing turbopuffer in the AI era00:38:17 Why Simon chose Lachy Groom00:41:28 Becoming a founder on purpose00:44:00 The “P99 engineer” philosophy00:49:30 Bending software to your will00:51:13 The future of turbopuffer00:57:05 Simon’s tea obsession00:59:03 Tea kits, X Live, and P99 Live Transcript Simon Hørup Eskildsen: I don’t think I’ve said this publicly before, but I just called Lockey and was like, local Lockie. Like if this doesn’t have PMF by the end of the year, like we’ll just like return all the money to you. But it’s just like, I don’t really, we, Justine and I don’t wanna work on this unless it’s really working. So we want to give it the best shot this year and like we’re really gonna go for it. We’re gonna hire a bunch of people. We’re just gonna be honest with everyone. Like when I don’t know how to play a game, I just play with open cards. Lockey was the only person that didn’t, that didn’t freak out. He was like, I’ve never heard anyone say that before. Alessio: Hey everyone, welcome to the Leading Space podcast. This is Celesio Pando, Colonel Laz, and I’m joined by Swix, editor of Leading Space. swyx: Hello. Hello, uh, we’re still, uh, recording in the Ker studio for the first time. Very excited. And today we are joined by Simon Eski. Of Turbo Farer welcome. Simon Hørup Eskildsen: Thank you so much for having me. swyx: Turbo Farer has like really gone on a huge tear, and I, I do have to mention that like you’re one of, you’re not my newest member of the Danish AHU Mafia, where like there’s a lot of legendary programmers that have come out of it, like, uh, beyond Trotro, Rasmus, lado Berg and the V eight team and, and Google Maps team. Uh, you’re mostly a Canadian now, but isn’t that interesting? There’s so many, so much like strong Danish presence. Simon Hørup Eskildsen: Yeah, I was writing a post, um, not that long ago about sort of the influences. So I grew up in Denmark, right? I left, I left when, when I was 18 to go to Canada to, to work at Shopify. Um, and so I, like, I’ve, I would still say that I feel more Danish than, than Canadian. This is also the weird accent. I can’t say th because it, this is like, I don’t, you know, my wife is also Canadian, um, and I think. I think like one of the things in, in Denmark is just like, there’s just such a ruthless pragmatism and there’s also a big focus on just aesthetics. Like, they’re like very, people really care about like where, what things look like. Um, and like Canada has a lot of attributes, US has, has a lot of attributes, but I think there’s been lots of the great things to carry. I don’t know what’s in the water in Ahu though. Um, and I don’t know that I could be considered part of the Mafi mafia quite yet, uh, compared to the phenomenal individuals we just mentioned. Barra OV is also, uh, Danish Canadian. Okay. Yeah. I don’t know where he lives now, but, and he’s the PHP. swyx: Yeah. And obviously Toby German, but moved to Canada as well. Yes. Like this is like import that, uh, that, that is an interesting, um, talent move. Alessio: I think. I would love to get from you. Definition of Turbo puffer, because I think you could be a Vector db, which is maybe a bad word now in some circles, you could be a search engine. It’s like, let, let’s just start there and then we’ll maybe run through the history of how you got to this point. Simon Hørup Eskildsen: For sure. Yeah. So Turbo Puffer is at this point in time, a search engine, right? We do full text search and we do vector search, and that’s really what we’re specialized in. If you’re trying to do much more than that, like then this might not be the right place yet, but Turbo Buffer is all about search. The other way that I think about it is that we can take all of the world’s knowledge, all of the exabytes and exabytes of data that there is, and we can use those tokens to train a model, but we can’t compress all of that into a few terabytes of weights, right? Compress into a few terabytes of weights, how to reason with the world, how to make sense of the knowledge. But we have to somehow connect it to something externally that actually holds that like in full fidelity and truth. Um, and that’s the thing that we intend to become. Right? That’s like a very holier than now kind of phrasing, right? But being the search engine for unstructured, unstructured data is the focus of turbo puffer at this point in time. Alessio: And let’s break down. So people might say, well, didn’t Elasticsearch already do this? And then some other people might say, is this search on my data, is this like closer to rag than to like a xr, like a public search thing? Like how, how do you segment like the different types of search? Simon Hørup Eskildsen: The way that I generally think about this is like, there’s a lot of database companies and I think if you wanna build a really big database company, sort of, you need a couple of ingredients to be in the air. We don’t, which only happens roughly every 15 years. You need a new workload. You basically need the ambition that every single company on earth is gonna have data in your database. Multiple times you look at a company like Oracle, right? You will, like, I don’t think you can find a company on earth with a digital presence that it not, doesn’t somehow have some data in an Oracle database. Right? And I think at this point, that’s also true for Snowflake and Databricks, right? 15 years later it’s, or even more than that, there’s not a company on earth that doesn’t, in. Or directly is consuming Snowflake or, or Databricks or any of the big analytics databases. Um, and I think we’
Join Kyle, Nader, Vibhu, and swyx live at NVIDIA GTC next week! Now that AIE Europe tix are ~sold out, our attention turns to Miami and World’s Fair! The definitive AI Accelerator chip company has more than 10xed this AI Summer: And is now a $4.4 trillion megacorp… that is somehow still moving like a startup. We are blessed to have a unique relationship with our first ever NVIDIA guests: Kyle Kranen who gave a great inference keynote at the first World’s Fair and is one of the leading architects of NVIDIA Dynamo (a Datacenter scale inference framework supporting SGLang, TRT-LLM, vLLM), and Nader Khalil, a friend of swyx from our days in Celo in The Arena, who has been drawing developers at GTC since before they were even a glimmer in the eye of NVIDIA: Nader discusses how NVIDIA Brev has drastically reduced the barriers to entry for developers to get a top of the line GPU up and running, and Kyle explains NVIDIA Dynamo as a data center scale inference engine that optimizes serving by scaling out, leveraging techniques like prefill/decode disaggregation, scheduling, and Kubernetes-based orchestration, framed around cost, latency, and quality tradeoffs. We also dive into Jensen’s “SOL” (Speed of Light) first-principles urgency concept, long-context limits and model/hardware co-design, internal model APIs (https://build.nvidia.com), and upcoming Dynamo and agent sessions at GTC. Full Video pod on YouTube Timestamps 00:00 Agent Security Basics00:39 Podcast Welcome and Guests07:19 Acquisition and DevEx Shift13:48 SOL Culture and Dynamo Setup27:38 Why Scale Out Wins29:02 Scale Up Limits Explained30:24 From Laptop to Multi Node33:07 Cost Quality Latency Tradeoffs38:42 Disaggregation Prefill vs Decode41:05 Kubernetes Scaling with Grove43:20 Context Length and Co Design57:34 Security Meets Agents58:01 Agent Permissions Model59:10 Build Nvidia Inference Gateway01:01:52 Hackathons And Autonomy Dreams01:10:26 Local GPUs And Scaling Inference01:15:31 Long Running Agents And SF Reflections Transcript Agent Security Basics Nader: Agents can do three things. They can access your files, they can access the internet, and then now they can write custom code and execute it. You literally only let an agent do two of those three things. If you can access your files and you can write custom code, you don’t want internet access because that’s one to see full vulnerability, right? If you have access to internet and your file system, you should know the full scope of what that agent’s capable of doing. Otherwise, now we can get injected or something that can happen. And so that’s a lot of what we’ve been thinking about is like, you know, how do we both enable this because it’s clearly the future. But then also, you know, what, what are these enforcement points that we can start to like protect? swyx: All right. Podcast Welcome and Guests swyx: Welcome to the Lean Space podcast in the Chromo studio. Welcome to all the guests here. Uh, we are back with our guest host Viu. Welcome. Good to have you back. And our friends, uh, Netter and Kyle from Nvidia. Welcome. Kyle: Yeah, thanks for having us. swyx: Yeah, thank you. Actually, I don’t even know your titles. Uh, I know you’re like architect something of Dynamo. Kyle: Yeah. I, I’m one of the engineering leaders [00:01:00] and a architects of Dynamo. swyx: And you’re director of something and developers, developer tech. Nader: Yeah. swyx: You’re the developers, developers, developers guy at nvidia, Nader: open source agent marketing, brev, swyx: and like Nader: Devrel tools and stuff. swyx: Yeah. Been Nader: the focus. swyx: And we’re, we’re kind of recording this ahead of Nvidia, GTC, which is coming to town, uh, again, uh, or taking over town, uh, which, uh, which we’ll all be at. Um, and we’ll talk a little bit about your sessions and stuff. Yeah. Nader: We’re super excited for it. GTC Booth Stunt Stories swyx: One of my favorite memories for Nader, like you always do like marketing stunts and like while you were at Rev, you like had this surfboard that you like, went down to GTC with and like, NA Nvidia apparently, like did so much that they bought you. Like what, what was that like? What was that? Nader: Yeah. Yeah, we, we, um. Our logo was a chaka. We, we, uh, we were always just kind of like trying to keep true to who we were. I think, you know, some stuff, startups, you’re like trying to pretend that you’re a bigger, more mature company than you are. And it was actually Evan Conrad from SF Compute who was just like, you guys are like previous swyx: guest. Yeah. Nader: Amazing. Oh, really? Amazing. Yeah. He was just like, guys, you’re two dudes in the room. Why are you [00:02:00] pretending that you’re not? Uh, and so then we were like, okay, let’s make the logo a shaka. We brought surfboards to our booth to GTC and the energy was great. Yeah. Some palm trees too. They, Kyle: they actually poked out over like the, the walls so you could, you could see the bread booth. Oh, that’s so funny. And Nader: no one else, Kyle: just from very far away. Nader: Oh, so you remember it back Kyle: then? Yeah I remember it pre-acquisition. I was like, oh, those guys look cool, Nader: dude. That makes sense. ‘cause uh, we, so we signed up really last minute, and so we had the last booth. It was all the way in the corner. And so I was, I was worried that no one was gonna come. So that’s why we had like the palm trees. We really came in with the surfboards. We even had one of our investors bring her dog and then she was just like walking the dog around to try to like, bring energy towards our booth. Yeah. swyx: Steph. Kyle: Yeah. Yeah, she’s the best, swyx: you know, as a conference organizer, I love that. Right? Like, it’s like everyone who sponsors a conference comes, does their booth. They’re like, we are changing the future of ai or something, some generic b******t and like, no, like actually try to stand out, make it fun, right? And people still remember it after three years. Nader: Yeah. Yeah. You know what’s so funny? I’ll, I’ll send, I’ll give you this clip if you wanna, if you wanna add it [00:03:00] in, but, uh, my wife was at the time fiance, she was in medical school and she came to help us. ‘cause it was like a big moment for us. And so we, we bought this cricket, it’s like a vinyl, like a vinyl, uh, printer. ‘cause like, how else are we gonna label the surfboard? So, we got a surfboard, luckily was able to purchase that on the company card. We got a cricket and it was just like fine tuning for enterprises or something like that, that we put on the. On the surfboard and it’s 1:00 AM the day before we go to GTC. She’s helping me put these like vinyl stickers on. And she goes, you son of, she’s like, if you pull this off, you son of a b***h. And so, uh, right. Pretty much after the acquisition, I stitched that with the mag music acquisition. I sent it to our family group chat. Oh swyx: Yeah. No, well, she, she made a good choice there. Was that like basically the origin story for Launchable is that we, it was, and maybe we should explain what Brev is and Nader: Yeah. Yeah. Uh, I mean, brev is just, it’s a developer tool that makes it really easy to get a GPU. So we connect a bunch of different GPU sources. So the basics of it is like, how quickly can we SSH you into a G, into a GPU and whenever we would talk to users, they wanted A GPU. They wanted an A 100. And if you go to like any cloud [00:04:00] provisioning page, usually it’s like three pages of forms or in the forms somewhere there’s a dropdown. And in the dropdown there’s some weird code that you know to translate to an A 100. And I remember just thinking like. Every time someone says they want an A 100, like the piece of text that they’re telling me that they want is like, stuffed away in the corner. Yeah. And so we were like, what if the biggest piece of text was what the user’s asking for? And so when you go to Brev, it’s just big GPU chips with the type that you want with swyx: beautiful animations that you worked on pre, like pre you can, like, now you can just prompt it. But back in the day. Yeah. Yeah. Those were handcraft, handcrafted artisanal code. Nader: Yeah. I was actually really proud of that because, uh, it was an, i I made it in Figma. Yeah. And then I found, I was like really struggling to figure out how to turn it from like Figma to react. So what it actually is, is just an SVG and I, I have all the styles and so when you change the chip, whether it’s like active or not it changes the SVG code and that somehow like renders like, looks like it’s animating, but it, we just had the transition slow, but it’s just like the, a JavaScript function to change the like underlying SVG. Yeah. And that was how I ended up like figuring out how to move it from from Figma. But yeah, that’s Art Artisan. [00:05:00] Kyle: Speaking of marketing stunts though, he actually used those SVGs. Or kind of use those SVGs to make these cards. Nader: Oh yeah. Like Kyle: a GPU gift card Yes. That he handed out everywhere. That was actually my first impression of that Nader: one. Yeah, swyx: yeah, yeah. Nader: Yeah. swyx: I think I still have one of them. Nader: They look great. Kyle: Yeah. Nader: I have a ton of them still actually in our garage, which just, they don’t have labels. We should honestly like bring, bring them back. But, um, I found this old printing press here, actually just around the corner on Ven ness. And it’s a third generation San Francisco shop. And so I come in an excited startup founder trying to like, and they just have this crazy old machinery and I’m in awe. ‘cause the the whole building is so physical. Like you’re seeing these machines, they have like pedals to like move these saws and whatever. I don’t know what this machinery is, but I saw all three generations. Like there’s like the grandpa
All speakers are announced at AIE EU, schedule coming soon. Join us there or in Miami with the renowned organizers of React Miami! Singapore CFP also open! We’ve called this out a few times over in AINews, but the overwhelming consensus in the Valley is that “the IDE is Dead”. In November it was just a gut feeling, but now we actually have data: even at the canonical “VSCode Fork” company, people are officially using more agents than tab autocomplete (the first wave of AI coding): Cursor has launched cloud agents for a few months now, and this specific launch is around Computer Use, which has come a long way since we first talked with Anthropic about it in 2024, and which Jonas productized as Autotab: We also take the opportunity to do a live demo, talk about slash commands and subagents, and the future of continual learning and personalized coding models, something that Sam previously worked on at New Computer. (The fact that both of these folks are top tier CEOs of their own startups that have now joined the insane talent density gathering at Cursor should also not be overlooked). Full Episode on YouTube! please like and subscribe! Timestamps 00:00 Agentic Code Experiments00:53 Why Cloud Agents Matter02:08 Testing First Pillar03:36 Video Reviews Second Pillar04:29 Remote Control Third Pillar06:17 Meta Demos and Bug Repro13:36 Slash Commands and MCPs18:19 From Tab to Team Workflow31:41 Minimal Web UI Philosophy32:40 Why No File Editor34:38 Full Stack Cursor Debate36:34 Model Choice and Auto Routing38:34 Parallel Agents and Best Of N41:41 Subagents and Context Management44:48 Grind Mode and Throughput Future01:00:24 Cloud Agent Onboarding and Memory Transcript EP 77 - CURSOR - Audio version [00:00:00] Agentic Code Experiments Samantha: This is another experiment that we ran last year and didn’t decide to ship at that time, but may come back to LM Judge, but one that was also agentic and could write code. So it wasn’t just picking but also taking the learnings from two models or and models that it was looking at and writing a new diff. And what we found was that there were strengths to using models from different model providers as the base level of this process. Basically you could get almost like a synergistic output that was better than having a very unified like bottom model tier. Jonas: We think that over the coming months, the big unlock is not going to be one person with a model getting more done, like the water flowing faster and we’ll be making the pipe much wider and so paralyzing more, whether that’s swarms of agents or parallel agents, both of those are things that contribute to getting much more done in the same amount of time. Why Cloud Agents Matter swyx: This week, one of the biggest launches that Cursor’s ever done is cloud agents. I think you, you had [00:01:00] cloud agents before, but this was like, you give cursor a computer, right? Yeah. So it’s just basically they bought auto tab and then they repackaged it. Is that what’s going on, or, Jonas: that’s a big part of it. Yeah. Cloud agents already ran in their own computers, but they were sort of site reading code. Yeah. And those computers were not, they were like blank VMs typically that were not set up for the Devrel X for whatever repo the agents working on. One of the things that we talk about is if you put yourself in the model shoes and you were seeing tokens stream by and all you could do was cite read code and spit out tokens and hope that you had done the right thing, swyx: no chance Jonas: I’d be so bad. Like you obviously you need to run the code. And so that I think also is probably not that contrarian of a take, but no one has done that yet. And so giving the model the tools to onboard itself and then use full computer use end-to-end pixels in coordinates out and have the cloud computer with different apps in it is the big unlock that we’ve seen internally in terms of use usage of this going from, oh, we use it for little copy changes [00:02:00] to no. We’re really like driving new features with this kind of new type of entech workflow. Alright, let’s see it. Cool. Live Demo Tour Jonas: So this is what it looks like in cursor.com/agents. So this is one I kicked off a while ago. So on the left hand side is the chat. Very classic sort of agentic thing. The big new thing here is that the agent will test its changes. So you can see here it worked for half an hour. That is because it not only took time to write the tokens of code, it also took time to test them end to end. So it started Devrel servers iterate when needed. And so that’s one part of it is like model works for longer and doesn’t come back with a, I tried some things pr, but a I tested at pr that’s ready for your review. One of the other intuition pumps we use there is if a human gave you a PR asked you to review it and you hadn’t, they hadn’t tested it, you’d also be annoyed because you’d be like, only ask me for a review once it’s actually ready. So that’s what we’ve done with Testing Defaults and Controls swyx: simple question I wanted to gather out front. Some prs are way smaller, [00:03:00] like just copy change. Does it always do the video or is it sometimes, Jonas: Sometimes. swyx: Okay. So what’s the judgment? Jonas: The model does it? So we we do some default prompting with sort. What types of changes to test? There’s a slash command that people can do called slash no test, where if you do that, the model will not test, swyx: but the default is test. Jonas: The default is to be calibrated. So we tell it don’t test, very simple copy changes, but test like more complex things. And then users can also write their agents.md and specify like this type of, if you’re editing this subpart of my mono repo, never tested ‘cause that won’t work or whatever. Videos and Remote Control Jonas: So pillar one is the model actually testing Pillar two is the model coming back with a video of what it did. We have found that in this new world where agents can end-to-end, write much more code, reviewing the code is one of these new bottlenecks that crop up. And so reviewing a video is not a substitute for reviewing code, but it is an entry point that is much, much easier to start with than glancing at [00:04:00] some giant diff. And so typically you kick one off you, it’s done you come back and the first thing that you would do is watch this video. So this is a, video of it. In this case I wanted a tool tip over this button. And so it went and showed me what that looks like in, in this video that I think here, it actually used a gallery. So sometimes it will build storybook type galleries where you can see like that component in action. And so that’s pillar two is like these demo videos of what it built. And then pillar number three is I have full remote control access to this vm. So I can go heat in here. I can hover things, I can type, I have full control. And same thing for the terminal. I have full access. And so that is also really useful because sometimes the video is like all you need to see. And oftentimes by the way, the video’s not perfect, the video will show you, is this worth either merging immediately or oftentimes is this worth iterating with to get it to that final stage where I am ready to merge in. So I can go through some other examples where the first video [00:05:00] wasn’t perfect, but it gave me confidence that we were on the right track and two or three follow-ups later, it was good to go. And then I also have full access here where some things you just wanna play around with. You wanna get a feel for what is this and there’s no substitute to a live preview. And the VNC kind of VM remote access gives you that. swyx: Amazing What, sorry? What is VN. And Jonas: just the remote desktop. Remote desktop. Yeah. swyx: Sam, any other details that you always wanna call out? Samantha: Yeah, for me the videos have been super helpful. I would say, especially in cases where a common problem for me with agents and cloud agents beforehand was almost like under specification in my requests where our plan mode and going really back and forth and getting detailed implementation spec is a way to reduce the risk of under specification, but then similar to how human communication breaks down over time, I feel like you have this risk where it’s okay, when I pull down, go to the triple of pulling down and like running this branch locally, I’m gonna see that, like I said, this should be a toggle and you have a checkbox and like, why didn’t you get that detail? And having the video up front just [00:06:00] has that makes that alignment like you’re talking about a shared artifact with the agent. Very clear, which has been just super helpful for me. Jonas: I can quickly run through some other Yes. Examples. Meta Agents and More Demos Jonas: So this is a very front end heavy one. So one question I was swyx: gonna say, is this only for front Jonas: end? Exactly. One question you might have is this only for front end? So this is another example where the thing I wanted it to implement was a better error message for saving secrets. So the cloud agents support adding secrets, that’s part of what it needs to access certain systems. Part of onboarding that is giving access. This is cloud is working on swyx: cloud agents. Yes. Jonas: So this is a fun thing is Samantha: it can get super meta. It Jonas: can get super meta, it can start its own cloud agents, it can talk to its own cloud agents. Sometimes it’s hard to wrap your mind around that. We have disabled, it’s cloud agents starting more cloud agents. So we currently disallow that. Someday you might. Someday we might. Someday we might. So this actually was mostly a backend change in terms of the error handling here, where if the [00:07:00] secret is far too large, it would oh, this is actually really cool. Wow. That’s the Devrel tools. That
The reception to our recent post on Code Reviews has been strong. Catch up! Amid a maelstrom of discussion on whether or not AI is killing SaaS, one of the top publicly listed SaaS companies in the world has just reported record revenues, clearing well over $1.1B in ARR for the first time with a 28% margin. As we comment on the pod, Aaron Levie is the rare public company CEO equally at home in both worlds of Silicon Valley and Wall Street/Main Street, by day helping 70% of the Fortune 500 with their Enterprise Advanced Suite, and yet by night is often found in the basements of early startups and tweeting viral insights about the future of agents. Now that both Cursor, Cloudflare, Perplexity, Anthropic and more have made Filesystems and Sandboxes and various forms of “Just Give the Agent a Box” cool (not just cool; it is now one of the single hottest areas in AI infrastructure growing 100% MoM), we find it a delightfully appropriate time to do the episode with the OG CEO who has been giving humans and computers Boxes since he was a college dropout pitching VCs at a Michael Arrington house party. Enjoy our special pod, with fan favorite returning guest/guest cohost Jeff Huber! Note: We didn’t directly discuss the AI vs SaaS debate - Aaron has done many, many, many other podcasts on that, and you should read his definitive essay on it. Most commentators do not understand SaaS businesses because they have never scaled one themselves, and deeply reflected on what the true value proposition of SaaS is. We also discuss Your Company is a Filesystem: We also shoutout CTO Ben Kus’ and the AI team, who talked about the technical architecture and will return for AIE WF 2026. Full Video Episode Timestamps * 00:00 Adapting Work for Agents * 01:29 Why Every Agent Needs a Box * 04:38 Agent Governance and Identity * 11:28 Why Coding Agents Took Off First * 21:42 Context Engineering and Search Limits * 31:29 Inside Agent Evals * 33:23 Industries and Datasets * 35:22 Building the Agent Team * 38:50 Read Write Agent Workflows * 41:54 Docs Graphs and Founder Mode * 55:38 Token FOMO Culture * 56:31 Production Function Secrets * 01:01:08 Film Roots to Box * 01:03:38 AI Future of Movies * 01:06:47 Media DevRel and Engineering Transcript Adapting Work for Agents Aaron Levie: Like you don’t write code, you talk to an agent and it goes and does it for you, and you may be at best review it. That’s even probably like, like largely not even what you’re doing. What’s happening is we are changing our work to make the agents effective. In that model, the agent didn’t really adapt to how we work. We basically adapted to how the agent works. All of the economy has to go through that exact same evolution. Right now, it’s a huge asset and an advantage for the teams that do it early and that are kinda wired into doing this ‘cause you’ll see compounding returns. But that’s just gonna take a while for most companies to actually go and get this deployed. swyx: Welcome to the Lane Space Pod. We’re back in the chroma studio with uh, chroma, CEO, Jeff Hoover. Welcome returning guest now guest host. Aaron Levie: It’s a pleasure. Wow. How’d you get upgraded to, uh, to that? swyx: Because he’s like the perfect guy to be guest those for you. Aaron Levie: That makes sense actually, for We love context. We, we both really love context le we really do. We really do. swyx: Uh, and we’re here with, uh, Aaron Levy. Welcome. Aaron Levie: Thank you. Good to, uh, good to be [00:01:00] here. swyx: Uh, yeah. So we’ve all met offline and like chatted a little bit, but like, it’s always nice to get these things in person and conversation. Yeah. You just started off with so much energy. You’re, you’re super excited about agents. I love Aaron Levie: agents. swyx: Yeah. Open claw. Just got by, got bought by OpenAI. No, not bought, but you know, you know what I mean? Aaron Levie: Some, some, you know, acquihire. Executive swyx: hire. Aaron Levie: Executive hire. Okay. Executive hire. Say, swyx: hey, that’s my term. Okay. Um, what are you pounding the table on on agents? You have so many insightful tweets. Why Every Agent Needs a Box Aaron Levie: Well, the thing that, that we get super excited by that I think is probably, you know, should be relatively obvious is we’ve, we’ve built a platform to help enterprises manage their files and their, their corporate files and the permissions of who has access to those files and the sharing collaboration of those files. All of those files contain really, really important information for the enterprise. It might have your contracts, it might have your research materials, it might have marketing information, it might have your memos. All that data obviously has, you know, predominantly been used by humans. [00:02:00] But there’s been one really interesting problem, which is that, you know, humans only really work with their files during an active engagement with them, and they kind of go away and you don’t really see them for a long time. And all of a sudden, uh, with the power of AI and AI agents, all of that data becomes extremely relevant as this ongoing source of, of answers to new questions of data that will transform into, into something else that, that produces value in your organization. It, it contains the answer to the new employee that’s onboarding, that needs to ramp up on a project. Um, it contains the answer to the right thing to sell a customer when you’re having a conversation to them, with them contains the roadmap information that’s gonna produce the next feature. So all that data. That previously we’ve been just sort of storing and, and you know, occasionally forgetting about, ‘cause we’re only working on the new active stuff. All of that information becomes valuable to the enterprise and it’s gonna become extremely valuable to end users because now they can have agents go find what they’re looking for and produce new, new [00:03:00] value and new data on that information. And it’s gonna become incredibly valuable to agents because agents can roam around and do a bunch of work and they’re gonna need access to that data as well. And um, and you know, sometimes that will be an agent that is sort of working on behalf of, of, of you and, and effectively as you as and, and they are kind of accessing all of the same information that you have access to and, and operating as you in the system. And then sometimes there’s gonna be agents that are just. Effectively autonomous and kind of run on their own and, and you’re gonna collaborate and work with them kind of like you did another person. Open Claw being the most recent and maybe first real sort of, you know, kind of, you know, up updating everybody’s, you know, views of this landscape version of, of what that could look like, which is, okay, I have an agent. It’s on its own system, it’s on its own computer, it has access to its own tools. I probably don’t give it access to my entire life. I probably communicate with it like I would an assistant or a colleague and then it, it sort of has this sandbox environment. So all of that has massive implications for a platform that manage that [00:04:00] enterprise data. We think it’s gonna just transform how we work with all of the enterprise content that we work with, and we just have to make sure we’re building the right platform to support that. swyx: The sort of shorthand I put it is as people build agents, everybody’s just realizing that every agent needs a box. Yes. And it’s nice to be called box and just give everyone a box. Aaron Levie: Hey, I if I, you know, if we can make that go viral, uh, like I, I think that that terminology, I, that’s the swyx: tagline. Every agent Aaron Levie: needs a box. Every agent needs a box. If we can make that the headline of this, I’m fine with this. And that’s the billboard I wanna like Yeah, exactly. Every agent needs a box. Um, I like it. Can we ship this? Like, swyx: okay, let’s do it. Yeah. Aaron Levie: Uh, my work here is done and I got the value I needed outta this podcast Drinks. swyx: Yeah. Agent Governance and Identity Aaron Levie: But, but, um, but, but, you know, so the thing that we, we kind of think about is, um, is, you know, whether you think the number 10 x or a hundred x or whatever the number is, we’re gonna have some order of magnitude more agents than people. That’s inevitable. It has to happen. So then the question is, what is the infrastructure that’s needed to make all those agents effective in the enterprise? Make sure that they are well governed. Make sure they’re only doing [00:05:00] safe things on your information. Make sure that they’re not getting exposed. The data that they shouldn’t have access to. There’s gonna be just incredibly spectacularly crazy security incidents that will happen with agents because you’ll prompt, inject an agent and sort of find your way through the CRM system and pull out data that you shouldn’t have access to. Oh, we Jeff Huber: have God, Aaron Levie: right? I mean, that’s just gonna happen all over the place, right? So, so then the thing is, is how do you make sure you have the right security, the permissions, the access controls, the data governance. Um, we actually don’t yet exactly know in many cases how we’re gonna regulate some of these agents, right? If you think about an agent in financial services, does it have the exact same financial sort of, uh, requirements that a human did? Or is it, is the risk fully on the human that was interacting or created the agent? All open questions, but no matter what, there’s gonna need to be a layer that manages the, the data they have access to, the workflows that they’re involved in, pulling up data from multiple systems. This is the new infrastructure opportunity in the era of agents. swyx: You have a piece on agent identities, [00:06:00] which I think was
This is a free preview of a paid episode. To hear more, visit www.latent.space AIE Europe CFP and AIE World’s Fair paper submissions for CAIS peer review are due TODAY - do not delay! Last call ever. We’re excited to welcome METR for their first LS Pod, hopefully the first of many: METR are keepers of currently the single most infamous chart in AI: But every Latent Space reader should be sophisticated enough to know that the details matter and that hype and hyperbole go hand in hand in AI social media, because the millions of impressions that got, by people who don’t understand or care about the nuances, disclaimers, and error bars, far outreaches the 69k views on the corrections by the people who actually made the chart: There’s a lot of nuance both in making benchmarks (as we discovered with OpenAI on our SWE-Bench Verified podcast) and in extrapolating results from them, especially where exponentials and sigmoids are concerned. METR’s Long Horizons work itself has known biases that the authors have responsibly disclosed, but go far too underappreciated in the pursuit of doomer chart porn. If you’re interested in a short, sharable TED talk version of this pod, over at AIE CODE we were blessed to feature Joel twice, as a stage talk and with a longer form small workshop with Q&A: We also make sure cover some of METR’s lesser known work on Threat Evaluation but also Developer Productivity, where 2x friend of the pod and now Zyphra founder Quentin Anthony was the ONLY productive participant! Finally, if you’re the sort to read these show notes to the end, then you definitely deserve some pictures of Joel shredding the guitar at Love Band Karaoke which we mention at the end: Full Video Pod Timestamps 00:00 What METR Means00:39 Podcast Intro With Joel01:39 ME vs TR03:33 Time Horizon Origin Story04:56 Picking Tasks And Biases09:13 Time Horizon Misconceptions11:37 Opus 4.5 And Trendlines14:27 Productivity Studies And Explosions29:50 Compute Slows Progress30:47 Algorithms Need Compute32:45 Industry Spend and Data34:57 Clusters and Shipping Timelines36:44 Prediction Markets for Models38:10 Manifold Alpha Story43:04 Beyond Benchmarks Evals51:39 METR Roadmap and Farewell Transcript
Swyx joined SAIL! Thank you SAIL Media, Prof. Tom Yeh, 8Lee, Hamid Bagheri, c9n, and many others for tuning into SAIL Live #6 with Nathan Lambert and Sebastian Raschka, PhD. Sharing here for the LS paid subscribers. We covered: This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Editor’s note: CuspAI raised a $100m Series A in September and is rumored to have reached a unicorn valuation. They have all-star advisors from Geoff Hinton to Yann Lecun and team of deep domain experts to tackle this next frontier in AI applications. In this episode, Max Welling traces the thread connecting quantum gravity, equivariant neural networks, diffusion models, and climate-focused materials discovery (yes, there is one!!!). We begin with a provocative framing: experiments as computation. Welling describes the idea of a “physics processing unit”—a world in which digital models and physical experiments work together, with nature itself acting as a kind of processor. It’s a grounded but ambitious vision of AI for science: not replacing chemists, but accelerating them.Along the way, we discuss: * Why symmetry and equivariance matter in deep learning * The tradeoff between scale and inductive bias * The deep mathematical links between diffusion models and stochastic thermodynamics * Why materials—not software—may be the real bottleneck for AI and the energy transition * What it actually takes to build an AI-driven materials platform Max reflects on moving from curiosity-driven theoretical physics (including work with Gerard ‘t Hooft) toward impact-driven research in climate and energy. The result is a conversation about convergence: physics and machine learning, digital models and laboratory experiments, long-term ambition and incremental progress. Full Video Episode Timestamps * 00:00:00 – The Physics Processing Unit (PPU): Nature as the Ultimate Computer * Max introduces the idea of a Physics Processing Unit — using real-world experiments as computation. * 00:00:44 – From Quantum Gravity to AI for Materials * Brandon frames Max’s career arc: VAE pioneer → equivariant GNNs → materials startup founder. * 00:01:34 – Curiosity vs Impact: How His Motivation Evolved * Max explains the shift from pure theoretical curiosity to climate-driven impact. * 00:02:43 – Why CaspAI Exists: Technology as Climate Strategy * Politics struggles; technology scales. Why materials innovation became the focus. * 00:03:39 – The Thread: Physics → Symmetry → Machine Learning * How gauge symmetry, group theory, and relativity informed equivariant neural networks. * 00:06:52 – AI for Science Is Exploding (Not Emerging) * The funding surge and why AI-for-Science feels like a new industrial era. * 00:07:53 – Why Now? The Two Catalysts Behind AI for Science * Protein folding, ML force fields, and the tipping point moment. * 00:10:12 – How Engineers Can Enter AI for Science * Practical pathways: curriculum, workshops, cross-disciplinary training. * 00:11:28 – Why Materials Matter More Than Software * The argument that everything—LLMs included—rests on materials innovation. * 00:13:02 – Materials as a Search Engine * The vision: automated exploration of chemical space like querying Google. * 01:14:48 – Inside CuspAI: The Platform Architecture * Generative models + multi-scale digital twin + experiment loop. * 00:21:17 – Automating Chemistry: Human-in-the-Loop First * Start manual → modular tools → agents → increasing autonomy. * 00:25:04 – Moonshots vs Incremental Wins * Balancing lighthouse materials with paid partnerships. * 00:26:22 – Why Breakthroughs Will Still Require Humans * Automation is vertical-specific and iterative. * 00:29:01 – What Is Equivariance (In Plain English)? * Symmetry in neural networks explained with the bottle example. * 00:30:01 – Why Not Just Use Data Augmentation? * The optimization trade-off between inductive bias and data scale. * 00:31:55 – Generative AI Meets Stochastic Thermodynamics * His upcoming book and the unification of diffusion models and physics. * 00:33:44 – When the Book Drops (ICLR?) Transcript Max: I want to think of it as what I would call a physics processing unit, like a PPU, right? Which is you have digital processing units and then you have physics processing units. So it’s basically nature doing computations for you. It’s the fastest computer known, as possible even. It’s a bit hard to program because you have to do all these experiments. Those are quite bulky, it’s like a very large thing you have to do. But in a way it is a computation and that’s the way I want to see it. You can do computations in a data center and then you can ask nature to do some computations. Your interface with nature is a bit more complicated. But then these things will have to seamlessly work together to get to a new material that you’re interested in. [01:00:44:14 - 01:01:34:08] Brandon: Yeah, it’s a pleasure to have Max Woehling as a guest today. Max has done so much over his career that I’ve been so excited about. If you’re in the deep learning community, you probably know Max for his work on variational autocoders, which has literally stood the test of prime or officially stood the test of prime. If you are a scientist, you probably know him for his like, binary work on graph neural networks on equivariance. And if you’re a material science, you probably know him about his new startup, CASPAI. Max has a long history doing lots of cool problems. You started in quantum gravity, which is I think very different than all of these other things you worked on. The first question for AI engineers and for scientists, what is the thread in how you think about problems? What is the thread in the type of things which excite you? And how do you decide what is the next big thing you want to work on? [01:01:34:08 - 01:02:41:13] Max: So it has actually evolved a lot. In my young days, let’s breathe, I would just follow what I would find super interesting. I have kind of this sensor. I think many people have, but maybe not really sort of use very much, which is like, you get this feeling about getting very excited about some problem. Like it could be, what’s inside of a black hole or what’s at the boundary of the universe or what are quantum mechanics actually all about. And so I follow that basically throughout my career. But I have to say that as you get older, this changes a little bit in the sense that there’s a new dimension coming to it and there’s this impact. Going in two-dimensional quantum gravity, you pretty much guaranteed there’s going to be no impact on what you do relative, maybe a few papers, but not in this world, this energy scale. As I get closer to retirement, which is fortunately still 10 years away or so, I do want to kind of make a positive impact in the world. And I got pretty worried about climate change. [01:02:43:15 - 01:03:19:11] Max: I think politics seems to have a hard time solving it, especially these days. And so I thought better work on it from the technology side. And that’s why we started CaspAI. But there’s also a lot of really interesting science problems in material science. And so it’s kind of combining both the impact you can make with it as well as the interesting science. So it’s sort of these two dimensions, like working on things which you feel there’s like, well, there’s something very deep going on here. And on the other hand, trying to build tools that can actually make a real impact in the world. [01:03:19:11 - 01:03:39:23] RJ: So the thread that when I look back, look at the different things that you worked out, some of them seem pretty connected, like the physics to equivariance and, yeah, and, uh, gravitational networks, maybe. And that seems to be somewhat related to Casp. Do you have a thread through there? [01:03:39:23 - 01:06:52:16] Max: Yeah. So physics is the thread. So having done, you know, spent a lot of time in theoretical physics, I think there is first very fundamental and exciting questions, like things that haven’t actually been figured out in quantum gravity. So that is really the frontier. There’s also a lot of mathematical tools that you can use, right? In, for instance, in particle physics, but also in general relativity, sort of symmetry space to play an enormously important role. And this goes all the way to gauge symmetries as well. And so applying these kinds of symmetries to, uh, machine learning was actually, you know, I thought of it as a very deep and interesting mathematical problem. I did this with Taco Cohen and Taco was the main driver behind this, went all the way from just simple, like rotational symmetries all the way to gauge symmetries on spheres and stuff like that. So, and, uh, Maurice Weiler, who’s also here, um, when he was a PhD student, he was a very good student with me, you know, he wrote an entire book, which I can really recommend about the role of symmetries in AI and machine learning. So I find this a very deep and interesting problem. So more recently, so I’ve taken a sort of different path, which is the relationship between diffusion models and that field called stochastic thermodynamics. This is basically the thermodynamics, which is a theory of equilibrium. So but then formulated for out of equilibrium systems. And it turns out that the mathematics that we use for diffusion models, but even for reinforcement learning for Schrodinger bridges for MCMC sampling has the same mathematics as this theoretical, this physical theory of non-equilibrium systems. And that got me very excited. And actually, uh, when I taught a course in, um, Mauschenberg, uh, it is South Africa, close to Cape Town at the African Institute for Mathematical Sciences Ames. And I turned that into a book site. Two years later, the book was finished. I’ve sent it to the publisher. And this is about the deep relationship between free energy, diffusion models, basically generative AI and stochastic thermodynamics. So it’s always some kind of, I don’t know, I find physics very deep. I also think a lot about quantum mechanics and it’s, it’s, it’s a completely weird theory that actually nobody really understands. And th
This is a free preview of a paid episode. To hear more, visit www.latent.space First speakers for AIE Europe and AIEi Miami have been announced. If you’re in Asia/Aus, come by Singapore and Melbourne. AI Engineering is going global! One year ago today, Anthropic launched Claude Code, to not much fanfare: The word of mouth was incredibly strong however, and so we were glad to be one of the first podcasts to invite Boris and Cat on in early May: As we discussed on the pod, all CC usage was API-based and therefore it was ridiculously expensive to do anything. This was then fixed by the team including Claude Code in the Claude Pro plan in early June, and then the virality caused us to make a rare trend call in late June: Now, 6 months on, Doug has just calculated that around 4% of GitHub is written by Claude Code: We talk about how Doug uses Claude Code to do SemiAnalysis work. Memory Mania In the second part of this episode, we also check in on Memory Mania, which is going to affect you (yes, you) at home if it hasn’t already: Full Episode on YouTube Timestamps 00:00 AI as Junior Analyst00:59 Meet Swyx and Doug03:30 From Value Mule to Semis06:28 Moore’s Law Ends Thesis12:02 Claude Code Awakening32:02 Agent Swarms Reality Check32:53 Kimi Swarm Benchmarks37:31 Bots vs Zapier Automation39:44 Claude Code Workflow Setup57:54 AGI Metrics and GDP01:04:48 Railroad CapEx Analogy01:06:00 Funding Bubbles and Demand01:08:11 Agents Replace Work Tools01:13:56 Codex vs Claude Race01:21:15 Microsoft and TPU Strategy01:34:13 TPU Window vs Nvidia01:36:30 HBM Supply Chain Squeeze01:39:41 Memory Shock and CXL01:45:20 Context Rationing Future01:54:37 Writing and Trail Lessons Transcript [00:00:00] AI as Junior Analyst [00:00:00] Doug: This crap makes mistakes all the time. All the time. It is still just like a, like I think of it once again as like a junior analyst, right? The analyst goes and does all this like really pain in the ass information and you bring it all together to make a good decision at the top. Historically what happens is that junior analyst, who I once was, went and gathered all that information, and after doing this enough times, there’s a meta level thinking that’s happening where it’s like, okay, here’s what I really understand and how this type of analysis, I’m an expert in, actually I’m very good at, I consistently have a hit rate. [00:00:28] Now I’m the expert, right? I don’t think that meta level learning is there yet. We’ll see if l ones do it, right? Everyone who’s spending one quadrillion dollars in the world thinks it will, it better, it better happen by if you’re spending, you know, a trillion dollars and there’s not meta level learning. [00:00:44] But for me, in our firm, that massively amplifies everyone who is an expert. ‘cause like you have to still do something that you can just like lop it up. It’s very obvious to me. What It’s slop. [00:00:59] Meet Swyx and Doug
Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/) arguing that SWE-Bench Verified—long treated as a key “North Star” coding benchmark—has become saturated and highly contaminated, making it less useful for measuring real coding progress. SWE-Bench Verified originated as a major OpenAI-led cleanup of the original Princeton SWE-Bench benchmark, including a large human review effort with nearly 100 software engineers and multiple independent reviews to curate ~500 higher-quality tasks. But recent findings show that many remaining failures can reflect unfair or overly narrow tests (e.g., requiring specific naming or unspecified implementation details) rather than true model inability, and cite examples suggesting contamination such as models recalling repository-specific implementation details or task identifiers. From now on, OpenAI plans to stop reporting SWE-Bench Verified and instead focus on SWE-Bench Pro (from Scale), which is harder, more diverse (more repos and languages), includes longer tasks (1–4 hours and 4+ hours), and shows substantially less evidence of contamination under their “contamination auditor agent” analysis. We also discuss what future coding/agent benchmarks should measure beyond pass/fail tests—longer-horizon tasks, open-ended design decisions, code quality/maintainability, and real-world product-building—along with the tradeoffs between fast automated grading and human-intensive evaluation. 00:00 Meet the Frontier Evals Team00:56 Why SWE Bench Stalled01:47 How Verified Was Built04:32 Contamination In The Wild06:16 Unfair Tests And Narrow Specs08:40 When Benchmarks Saturate10:28 Switching To SWE Bench Pro12:31 What Great Coding Evals Measure18:17 Beyond Tests Dollars And Autonomy21:49 Preparedness And Future Directions This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Tickets for AIEi Miami and AIE Europe are live, with first wave speakers announced! From pioneering software-defined networking to backing many of the most aggressive AI model companies of this cycle, Martin Casado and Sarah Wang sit at the center of the capital, compute, and talent arms race reshaping the tech industry. As partners at a16z investing across infrastructure and growth, they’ve watched venture and growth blur, model labs turn dollars into capability at unprecedented speed, and startups raise nine-figure rounds before monetization.Martin and Sarah join us to unpack the new financing playbook for AI: why today’s rounds are really compute contracts in disguise, how the “raise → train → ship → raise bigger” flywheel works, and whether foundation model companies can outspend the entire app ecosystem built on top of them. They also share what’s underhyped (boring enterprise software), what’s overheated (talent wars and compensation spirals), and the two radically different futures they see for AI’s market structure. We discuss: * Martin’s “two futures” fork: infinite fragmentation and new software categories vs. a small oligopoly of general models that consume everything above them * The capital flywheel: how model labs translate funding directly into capability gains, then into revenue growth measured in weeks, not years * Why venture and growth have merged: $100M–$1B hybrid rounds, strategic investors, compute negotiations, and complex deal structures * The AGI vs. product tension: allocating scarce GPUs between long-term research and near-term revenue flywheels * Whether frontier labs can out-raise and outspend the entire app ecosystem built on top of their APIs * Why today’s talent wars ($10M+ comp packages, $B acqui-hires) are breaking early-stage founder math * Cursor as a case study: building up from the app layer while training down into your own models * Why “boring” enterprise software may be the most underinvested opportunity in the AI mania * Hardware and robotics: why the ChatGPT moment hasn’t yet arrived for robots and what would need to change * World Labs and generative 3D: bringing the marginal cost of 3D scene creation down by orders of magnitude * Why public AI discourse is often wildly disconnected from boardroom reality and how founders should navigate the noise Show Notes: * “Where Value Will Accrue in AI: Martin Casado & Sarah Wang” - a16z show * “Jack Altman & Martin Casado on the Future of Venture Capital” * World Labs —Martin Casado• LinkedIn: https://www.linkedin.com/in/martincasado/• X: https://x.com/martin_casadoSarah Wang• LinkedIn: https://www.linkedin.com/in/sarah-wang-59b96a7• X: https://x.com/sarahdingwanga16z• https://a16z.com/ Timestamps 00:00:00 – Intro: Live from a16z00:01:20 – The New AI Funding Model: Venture + Growth Collide00:03:19 – Circular Funding, Demand & “No Dark GPUs”00:05:24 – Infrastructure vs Apps: The Lines Blur00:06:24 – The Capital Flywheel: Raise → Train → Ship → Raise Bigger00:09:39 – Can Frontier Labs Outspend the Entire App Ecosystem?00:11:24 – Character AI & The AGI vs Product Dilemma00:14:39 – Talent Wars, $10M Engineers & Founder Anxiety00:17:33 – What’s Underinvested? The Case for “Boring” Software00:19:29 – Robotics, Hardware & Why It’s Hard to Win00:22:42 – Custom ASICs & The $1B Training Run Economics00:24:23 – American Dynamism, Geography & AI Power Centers00:26:48 – How AI Is Changing the Investor Workflow (Claude Cowork)00:29:12 – Two Futures of AI: Infinite Expansion or Oligopoly?00:32:48 – If You Can Raise More Than Your Ecosystem, You Win00:34:27 – Are All Tasks AGI-Complete? Coding as the Test Case00:38:55 – Cursor & The Power of the App Layer00:44:05 – World Labs, Spatial Intelligence & 3D Foundation Models00:47:20 – Thinking Machines, Founder Drama & Media Narratives00:52:30 – Where Long-Term Power Accrues in the AI Stack Transcript Latent.Space - Inside AI’s $10B+ Capital Flywheel — Martin Casado & Sarah Wang of a16z [00:00:00] Welcome to Latent Space (Live from a16z) + Meet the Guests [00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast, live from a 16 z. Uh, this is Alessio founder Kernel Lance, and I’m joined by Twix, editor of Latent Space. [00:00:08] swyx: Hey, hey, hey. Uh, and we’re so glad to be on with you guys. Also a top AI podcast, uh, Martin Cado and Sarah Wang. Welcome, very [00:00:16] Martin Casado: happy to be here and welcome. [00:00:17] swyx: Yes, uh, we love this office. We love what you’ve done with the place. Uh, the new logo is everywhere now. It’s, it’s still getting, takes a while to get used to, but it reminds me of like sort of a callback to a more ambitious age, which I think is kind of [00:00:31] Martin Casado: definitely makes a statement. [00:00:33] swyx: Yeah. [00:00:34] Martin Casado: Not quite sure what that statement is, but it makes a statement. [00:00:37] swyx: Uh, Martin, I go back with you to Netlify. [00:00:40] Martin Casado: Yep. [00:00:40] swyx: Uh, and, uh, you know, you create a software defined networking and all, all that stuff people can read up on your background. Yep. Sarah, I’m newer to you. Uh, you, you sort of started working together on AI infrastructure stuff. [00:00:51] Sarah Wang: That’s right. Yeah. Seven, seven years ago now. [00:00:53] Martin Casado: Best growth investor in the entire industry. [00:00:55] swyx: Oh, say [00:00:56] Martin Casado: more hands down there is, there is. [00:01:00] I mean, when it comes to AI companies, Sarah, I think has done the most kind of aggressive, um, investment thesis around AI models, right? So, worked for Nom Ja, Mira Ia, FEI Fey, and so just these frontier, kind of like large AI models. [00:01:15] I think, you know, Sarah’s been the, the broadest investor. Is that fair? [00:01:20] Venture vs. Growth in the Frontier Model Era [00:01:20] Sarah Wang: No, I, well, I was gonna say, I think it’s been a really interesting tag, tag team actually just ‘cause the, a lot of these big C deals, not only are they raising a lot of money, um, it’s still a tech founder bet, which obviously is inherently early stage. [00:01:33] But the resources, [00:01:36] Martin Casado: so many, I [00:01:36] Sarah Wang: was gonna say the resources one, they just grow really quickly. But then two, the resources that they need day one are kind of growth scale. So I, the hybrid tag team that we have is. Quite effective, I think, [00:01:46] Martin Casado: what is growth these days? You know, you don’t wake up if it’s less than a billion or like, it’s, it’s actually, it’s actually very like, like no, it’s a very interesting time in investing because like, you know, take like the character around, right? [00:01:59] These tend to [00:02:00] be like pre monetization, but the dollars are large enough that you need to have a larger fund and the analysis. You know, because you’ve got lots of users. ‘cause this stuff has such high demand requires, you know, more of a number sophistication. And so most of these deals, whether it’s US or other firms on these large model companies, are like this hybrid between venture growth. [00:02:18] Sarah Wang: Yeah. Total. And I think, you know, stuff like BD for example, you wouldn’t usually need BD when you were seed stage trying to get market biz Devrel. Biz Devrel, exactly. Okay. But like now, sorry, I’m, [00:02:27] swyx: I’m not familiar. What, what, what does biz Devrel mean for a venture fund? Because I know what biz Devrel means for a company. [00:02:31] Sarah Wang: Yeah. [00:02:32] Compute Deals, Strategics, and the ‘Circular Funding’ Question [00:02:32] Sarah Wang: You know, so a, a good example is, I mean, we talk about buying compute, but there’s a huge negotiation involved there in terms of, okay, do you get equity for the compute? What, what sort of partner are you looking at? Is there a go-to market arm to that? Um, and these are just things on this scale, hundreds of millions, you know, maybe. [00:02:50] Six months into the inception of a company, you just wouldn’t have to negotiate these deals before. [00:02:54] Martin Casado: Yeah. These large rounds are very complex now. Like in the past, if you did a series A [00:03:00] or a series B, like whatever, you’re writing a 20 to a $60 million check and you call it a day. Now you normally have financial investors and strategic investors, and then the strategic portion always still goes with like these kind of large compute contracts, which can take months to do. [00:03:13] And so it’s, it’s very different ties. I’ve been doing this for 10 years. It’s the, I’ve never seen anything like this. [00:03:19] swyx: Yeah. Do you have worries about the circular funding from so disease strategics? [00:03:24] Martin Casado: I mean, listen, as long as the demand is there, like the demand is there. Like the problem with the internet is the demand wasn’t there. [00:03:29] swyx: Exactly. All right. This, this is like the, the whole pyramid scheme bubble thing, where like, as long as you mark to market on like the notional value of like, these deals, fine, but like once it starts to chip away, it really Well [00:03:41] Martin Casado: no, like as, as, as, as long as there’s demand. I mean, you know, this, this is like a lot of these sound bites have already become kind of cliches, but they’re worth saying it. [00:03:47] Right? Like during the internet days, like we were. Um, raising money to put fiber in the ground that wasn’t used. And that’s a problem, right? Because now you actually have a supply overhang. [00:03:58] swyx: Mm-hmm. [00:03:59] Martin Casado: And even in the, [00:04:00] the time of the, the internet, like the supply and, and bandwidth overhang, even as massive as it was in, as massive as the crash was only lasted about four years. [00:04:09] But we don’t have a supply
From rewriting Google’s search stack in the early 2000s to reviving sparse trillion-parameter models and co-designing TPUs with frontier ML research, Jeff Dean has quietly shaped nearly every layer of the modern AI stack. As Chief AI Scientist at Google and a driving force behind Gemini, Jeff has lived through multiple scaling revolutions from CPUs and sharded indices to multimodal models that reason across text, video, and code. Jeff joins us to unpack what it really means to “own the Pareto frontier,” why distillation is the engine behind every Flash model breakthrough, how energy (in picojoules) not FLOPs is becoming the true bottleneck, what it was like leading the charge to unify all of Google’s AI teams, and why the next leap won’t come from bigger context windows alone, but from systems that give the illusion of attending to trillions of tokens. We discuss: * Jeff’s early neural net thesis in 1990: parallel training before it was cool, why he believed scaling would win decades early, and the “bigger model, more data, better results” mantra that held for 15 years * The evolution of Google Search: sharding, moving the entire index into memory in 2001, softening query semantics pre-LLMs, and why retrieval pipelines already resemble modern LLM systems * Pareto frontier strategy: why you need both frontier “Pro” models and low-latency “Flash” models, and how distillation lets smaller models surpass prior generations * Distillation deep dive: ensembles → compression → logits as soft supervision, and why you need the biggest model to make the smallest one good * Latency as a first-class objective: why 10–50x lower latency changes UX entirely, and how future reasoning workloads will demand 10,000 tokens/sec * Energy-based thinking: picojoules per bit, why moving data costs 1000x more than a multiply, batching through the lens of energy, and speculative decoding as amortization * TPU co-design: predicting ML workloads 2–6 years out, speculative hardware features, precision reduction, sparsity, and the constant feedback loop between model architecture and silicon * Sparse models and “outrageously large” networks: trillions of parameters with 1–5% activation, and why sparsity was always the right abstraction * Unified vs. specialized models: abandoning symbolic systems, why general multimodal models tend to dominate vertical silos, and when vertical fine-tuning still makes sense * Long context and the illusion of scale: beyond needle-in-a-haystack benchmarks toward systems that narrow trillions of tokens to 117 relevant documents * Personalized AI: attending to your emails, photos, and documents (with permission), and why retrieval + reasoning will unlock deeply personal assistants * Coding agents: 50 AI interns, crisp specifications as a new core skill, and how ultra-low latency will reshape human–agent collaboration * Why ideas still matter: transformers, sparsity, RL, hardware, systems — scaling wasn’t blind; the pieces had to multiply together Show Notes: * Gemma 3 Paper * Gemma 3 * Gemini 2.5 Report * Jeff Dean’s “Software Engineering Advice from Building Large-Scale Distributed Systems” Presentation (with Back of the Envelope Calculations) * Latency Numbers Every Programmer Should Know by Jeff Dean * The Jeff Dean Facts * Jeff Dean Google Bio * Jeff Dean on “Important AI Trends” @Stanford AI Club * Jeff Dean & Noam Shazeer — 25 years at Google (Dwarkesh) — Jeff Dean * LinkedIn: https://www.linkedin.com/in/jeff-dean-8b212555 * X: https://x.com/jeffdean Google * https://google.com * https://deepmind.google Full Video Episode Timestamps 00:00:04 — Introduction: Alessio & Swyx welcome Jeff Dean, chief AI scientist at Google, to the Latent Space podcast00:00:30 — Owning the Pareto Frontier & balancing frontier vs low-latency models00:01:31 — Frontier models vs Flash models + role of distillation00:03:52 — History of distillation and its original motivation00:05:09 — Distillation’s role in modern model scaling00:07:02 — Model hierarchy (Flash, Pro, Ultra) and distillation sources00:07:46 — Flash model economics & wide deployment00:08:10 — Latency importance for complex tasks00:09:19 — Saturation of some tasks and future frontier tasks00:11:26 — On benchmarks, public vs internal00:12:53 — Example long-context benchmarks & limitations00:15:01 — Long-context goals: attending to trillions of tokens00:16:26 — Realistic use cases beyond pure language00:18:04 — Multimodal reasoning and non-text modalities00:19:05 — Importance of vision & motion modalities00:20:11 — Video understanding example (extracting structured info)00:20:47 — Search ranking analogy for LLM retrieval00:23:08 — LLM representations vs keyword search00:24:06 — Early Google search evolution & in-memory index00:26:47 — Design principles for scalable systems00:28:55 — Real-time index updates & recrawl strategies00:30:06 — Classic “Latency numbers every programmer should know”00:32:09 — Cost of memory vs compute and energy emphasis00:34:33 — TPUs & hardware trade-offs for serving models00:35:57 — TPU design decisions & co-design with ML00:38:06 — Adapting model architecture to hardware00:39:50 — Alternatives: energy-based models, speculative decoding00:42:21 — Open research directions: complex workflows, RL00:44:56 — Non-verifiable RL domains & model evaluation00:46:13 — Transition away from symbolic systems toward unified LLMs00:47:59 — Unified models vs specialized ones00:50:38 — Knowledge vs reasoning & retrieval + reasoning00:52:24 — Vertical model specialization & modules00:55:21 — Token count considerations for vertical domains00:56:09 — Low resource languages & contextual learning00:59:22 — Origins: Dean’s early neural network work01:10:07 — AI for coding & human–model interaction styles01:15:52 — Importance of crisp specification for coding agents01:19:23 — Prediction: personalized models & state retrieval01:22:36 — Token-per-second targets (10k+) and reasoning throughput01:23:20 — Episode conclusion and thanks Transcript Alessio Fanelli [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, founder of Kernel Labs, and I’m joined by Swyx, editor of Latent Space. Shawn Wang [00:00:11]: Hello, hello. We’re here in the studio with Jeff Dean, chief AI scientist at Google. Welcome. Thanks for having me. It’s a bit surreal to have you in the studio. I’ve watched so many of your talks, and obviously your career has been super legendary. So, I mean, congrats. I think the first thing must be said, congrats on owning the Pareto Frontier. Jeff Dean [00:00:30]: Thank you, thank you. Pareto Frontiers are good. It’s good to be out there. Shawn Wang [00:00:34]: Yeah, I mean, I think it’s a combination of both. You have to own the Pareto Frontier. You have to have like frontier capability, but also efficiency, and then offer that range of models that people like to use. And, you know, some part of this was started because of your hardware work. Some part of that is your model work, and I’m sure there’s lots of secret sauce that you guys have worked on cumulatively. But, like, it’s really impressive to see it all come together in, like, this slittily advanced. Jeff Dean [00:01:04]: Yeah, yeah. I mean, I think, as you say, it’s not just one thing. It’s like a whole bunch of things up and down the stack. And, you know, all of those really combine to help make UNOS able to make highly capable large models, as well as, you know, software techniques to get those large model capabilities into much smaller, lighter weight models that are, you know, much more cost effective and lower latency, but still, you know, quite capable for their size. Yeah. Alessio Fanelli [00:01:31]: How much pressure do you have on, like, having the lower bound of the Pareto Frontier, too? I think, like, the new labs are always trying to push the top performance frontier because they need to raise more money and all of that. And you guys have billions of users. And I think initially when you worked on the CPU, you were thinking about, you know, if everybody that used Google, we use the voice model for, like, three minutes a day, they were like, you need to double your CPU number. Like, what’s that discussion today at Google? Like, how do you prioritize frontier versus, like, we have to do this? How do we actually need to deploy it if we build it? Jeff Dean [00:02:03]: Yeah, I mean, I think we always want to have models that are at the frontier or pushing the frontier because I think that’s where you see what capabilities now exist that didn’t exist at the sort of slightly less capable last year’s version or last six months ago version. At the same time, you know, we know those are going to be really useful for a bunch of use cases, but they’re going to be a bit slower and a bit more expensive than people might like for a bunch of other broader models. So I think what we want to do is always have kind of a highly capable sort of affordable model that enables a whole bunch of, you know, lower latency use cases. People can use them for agentic coding much more readily and then have the high-end, you know, frontier model that is really useful for, you know, deep reasoning, you know, solving really complicated math problems, those kinds of things. And it’s not that. One or the other is useful. They’re both useful. So I think we’d like to do both. And also, you know, through distillation, which is a key technique for making the smaller models more capable, you know, you have to have the frontier model in order to then distill it into your smaller model. So it’s not like an either or choice. You sort of need that in order to actually get a highly capable, more modest size model. Yeah. Alessio Fanelli [00:03:24]: I mean, you and Jeffrey came up with the solution in 2014. Jeff Dean [00:03:
This podcast features Gabriele Corso and Jeremy Wohlwend, co-founders of Boltz and authors of the Boltz Manifesto, discussing the rapid evolution of structural biology models from AlphaFold to their own open-source suite, Boltz-1 and Boltz-2. The central thesis is that while single-chain protein structure prediction is largely “solved” through evolutionary hints, the next frontier lies in modeling complex interactions (protein-ligand, protein-protein) and generative protein design, which Boltz aims to democratize via open-source foundations and scalable infrastructure. Full Video Pod On YouTube! Timestamps * 00:00 Introduction to Benchmarking and the “Solved” Protein Problem * 06:48 Evolutionary Hints and Co-evolution in Structure Prediction * 10:00 The Importance of Protein Function and Disease States * 15:31 Transitioning from AlphaFold 2 to AlphaFold 3 Capabilities * 19:48 Generative Modeling vs. Regression in Structural Biology * 25:00 The “Bitter Lesson” and Specialized AI Architectures * 29:14 Development Anecdotes: Training Boltz-1 on a Budget * 32:00 Validation Strategies and the Protein Data Bank (PDB) * 37:26 The Mission of Boltz: Democratizing Access and Open Source * 41:43 Building a Self-Sustaining Research Community * 44:40 Boltz-2 Advancements: Affinity Prediction and Design * 51:03 BoltzGen: Merging Structure and Sequence Prediction * 55:18 Large-Scale Wet Lab Validation Results * 01:02:44 Boltz Lab Product Launch: Agents and Infrastructure * 01:13:06 Future Directions: Developpability and the “Virtual Cell” * 01:17:35 Interacting with Skeptical Medicinal Chemists Key Summary Evolution of Structure Prediction & Evolutionary Hints * Co-evolutionary Landscapes: The speakers explain that breakthrough progress in single-chain protein prediction relied on decoding evolutionary correlations where mutations in one position necessitate mutations in another to conserve 3D structure. * Structure vs. Folding: They differentiate between structure prediction (getting the final answer) and folding (the kinetic process of reaching that state), noting that the field is still quite poor at modeling the latter. * Physics vs. Statistics: RJ posits that while models use evolutionary statistics to find the right “valley” in the energy landscape, they likely possess a “light understanding” of physics to refine the local minimum. The Shift to Generative Architectures * Generative Modeling: A key leap in AlphaFold 3 and Boltz-1 was moving from regression (predicting one static coordinate) to a generative diffusion approach that samples from a posterior distribution. * Handling Uncertainty: This shift allows models to represent multiple conformational states and avoid the “averaging” effect seen in regression models when the ground truth is ambiguous. * Specialized Architectures: Despite the “bitter lesson” of general-purpose transformers, the speakers argue that equivariant architectures remain vastly superior for biological data due to the inherent 3D geometric constraints of molecules. Boltz-2 and Generative Protein Design * Unified Encoding: Boltz-2 (and BoltzGen) treats structure and sequence prediction as a single task by encoding amino acid identities into the atomic composition of the predicted structure. * Design Specifics: Instead of a sequence, users feed the model blank tokens and a high-level “spec” (e.g., an antibody framework), and the model decodes both the 3D structure and the corresponding amino acids. * Affinity Prediction: While model confidence is a common metric, Boltz-2 focuses on affinity prediction—quantifying exactly how tightly a designed binder will stick to its target. Real-World Validation and Productization * Generalized Validation: To prove the model isn’t just “regurgitating” known data, Boltz tested its designs on 9 targets with zero known interactions in the PDB, achieving nanomolar binders for two-thirds of them. * Boltz Lab Infrastructure: The newly launched Boltz Lab platform provides “agents” for protein and small molecule design, optimized to run 10x faster than open-source versions through proprietary GPU kernels. * Human-in-the-Loop: The platform is designed to convert skeptical medicinal chemists by allowing them to run parallel screens and use their intuition to filter model outputs. Transcript RJ [00:05:35]: But the goal remains to, like, you know, really challenge the models, like, how well do these models generalize? And, you know, we’ve seen in some of the latest CASP competitions, like, while we’ve become really, really good at proteins, especially monomeric proteins, you know, other modalities still remain pretty difficult. So it’s really essential, you know, in the field that there are, like, these efforts to gather, you know, benchmarks that are challenging. So it keeps us in line, you know, about what the models can do or not. Gabriel [00:06:26]: Yeah, it’s interesting you say that, like, in some sense, CASP, you know, at CASP 14, a problem was solved and, like, pretty comprehensively, right? But at the same time, it was really only the beginning. So you can say, like, what was the specific problem you would argue was solved? And then, like, you know, what is remaining, which is probably quite open. RJ [00:06:48]: I think we’ll steer away from the term solved, because we have many friends in the community who get pretty upset at that word. And I think, you know, fairly so. But the problem that was, you know, that a lot of progress was made on was the ability to predict the structure of single chain proteins. So proteins can, like, be composed of many chains. And single chain proteins are, you know, just a single sequence of amino acids. And one of the reasons that we’ve been able to make such progress is also because we take a lot of hints from evolution. So the way the models work is that, you know, they sort of decode a lot of hints. That comes from evolutionary landscapes. So if you have, like, you know, some protein in an animal, and you go find the similar protein across, like, you know, different organisms, you might find different mutations in them. And as it turns out, if you take a lot of the sequences together, and you analyze them, you see that some positions in the sequence tend to evolve at the same time as other positions in the sequence, sort of this, like, correlation between different positions. And it turns out that that is typically a hint that these two positions are close in three dimension. So part of the, you know, part of the breakthrough has been, like, our ability to also decode that very, very effectively. But what it implies also is that in absence of that co-evolutionary landscape, the models don’t quite perform as well. And so, you know, I think when that information is available, maybe one could say, you know, the problem is, like, somewhat solved. From the perspective of structure prediction, when it isn’t, it’s much more challenging. And I think it’s also worth also differentiating the, sometimes we confound a little bit, structure prediction and folding. Folding is the more complex process of actually understanding, like, how it goes from, like, this disordered state into, like, a structured, like, state. And that I don’t think we’ve made that much progress on. But the idea of, like, yeah, going straight to the answer, we’ve become pretty good at. Brandon [00:08:49]: So there’s this protein that is, like, just a long chain and it folds up. Yeah. And so we’re good at getting from that long chain in whatever form it was originally to the thing. But we don’t know how it necessarily gets to that state. And there might be intermediate states that it’s in sometimes that we’re not aware of. RJ [00:09:10]: That’s right. And that relates also to, like, you know, our general ability to model, like, the different, you know, proteins are not static. They move, they take different shapes based on their energy states. And I think we are, also not that good at understanding the different states that the protein can be in and at what frequency, what probability. So I think the two problems are quite related in some ways. Still a lot to solve. But I think it was very surprising at the time, you know, that even with these evolutionary hints that we were able to, you know, to make such dramatic progress. Brandon [00:09:45]: So I want to ask, why does the intermediate states matter? But first, I kind of want to understand, why do we care? What proteins are shaped like? Gabriel [00:09:54]: Yeah, I mean, the proteins are kind of the machines of our body. You know, the way that all the processes that we have in our cells, you know, work is typically through proteins, sometimes other molecules, sort of intermediate interactions. And through that interactions, we have all sorts of cell functions. And so when we try to understand, you know, a lot of biology, how our body works, how disease work. So we often try to boil it down to, okay, what is going right in case of, you know, our normal biological function and what is going wrong in case of the disease state. And we boil it down to kind of, you know, proteins and kind of other molecules and their interaction. And so when we try predicting the structure of proteins, it’s critical to, you know, have an understanding of kind of those interactions. It’s a bit like seeing the difference between... Having kind of a list of parts that you would put it in a car and seeing kind of the car in its final form, you know, seeing the car really helps you understand what it does. On the other hand, kind of going to your question of, you know, why do we care about, you know, how the protein falls or, you know, how the car is made to some extent is that, you know, sometimes when something goes wrong, you know, there are, you know, cases of, you know, proteins misfolding. In some diseases and so on, if we don’t understand this fo
From Palantir and Two Sigma to building Goodfire into the poster-child for actionable mechanistic interpretability, Mark Bissell (Member of Technical Staff) and Myra Deng (Head of Product) are trying to turn “peeking inside the model” into a repeatable production workflow by shipping APIs, landing real enterprise deployments, and now scaling the bet with a recent $150M Series B funding round at a $1.25B valuation. In this episode, we go far beyond the usual “SAEs are cool” take. We talk about Goodfire’s core bet: that the AI lifecycle is still fundamentally broken because the only reliable control we have is data and we post-train, RLHF, and fine-tune by “slurping supervision through a straw,” hoping the model picks up the right behaviors while quietly absorbing the wrong ones. Goodfire’s answer is to build a bi-directional interface between humans and models: read what’s happening inside, edit it surgically, and eventually use interpretability during training so customization isn’t just brute-force guesswork. Mark and Myra walk through what that looks like when you stop treating interpretability like a lab demo and start treating it like infrastructure: lightweight probes that add near-zero latency, token-level safety filters that can run at inference time, and interpretability workflows that survive messy constraints (multilingual inputs, synthetic→real transfer, regulated domains, no access to sensitive data). We also get a live window into what “frontier-scale interp” means operationally (i.e. steering a trillion-parameter model in real time by targeting internal features) plus why the same tooling generalizes cleanly from language models to genomics, medical imaging, and “pixel-space” world models. We discuss: * Myra + Mark’s path: Palantir (health systems, forward-deployed engineering) → Goodfire early team; Two Sigma → Head of Product, translating frontier interpretability research into a platform and real-world deployments * What “interpretability” actually means in practice: not just post-hoc poking, but a broader “science of deep learning” approach across the full AI lifecycle (data curation → post-training → internal representations → model design) * Why post-training is the first big wedge: “surgical edits” for unintended behaviors likereward hacking, sycophancy, noise learned during customization plus the dream of targeted unlearning and bias removal without wrecking capabilities * SAEs vs probes in the real world: why SAE feature spaces sometimes underperform classifiers trained on raw activations for downstream detection tasks (hallucination, harmful intent, PII), and what that implies about “clean concept spaces” * Rakuten in production: deploying interpretability-based token-level PII detection at inference time to prevent routing private data to downstream providers plus the gnarly constraints: no training on real customer PII, synthetic→real transfer, English + Japanese, and tokenization quirks * Why interp can be operationally cheaper than LLM-judge guardrails: probes are lightweight, low-latency, and don’t require hosting a second large model in the loop * Real-time steering at frontier scale: a demo of steering Kimi K2 (~1T params) live and finding features via SAE pipelines, auto-labeling via LLMs, and toggling a “Gen-Z slang” feature across multiple layers without breaking tool use * Hallucinations as an internal signal: the case that models have latent uncertainty / “user-pleasing” circuitry you can detect and potentially mitigate more directly than black-box methods * Steering vs prompting: the emerging view that activation steering and in-context learning are more closely connected than people think, including work mapping between the two (even for jailbreak-style behaviors) * Interpretability for science: using the same tooling across domains (genomics, medical imaging, materials) to debug spurious correlations and extract new knowledge up to and including early biomarker discovery work with major partners * World models + “pixel-space” interpretability: why vision/video models make concepts easier to see, how that accelerates the feedback loop, and why robotics/world-model partners are especially interesting design partners * The north star: moving from “data in, weights out” to intentional model design where experts can impart goals and constraints directly, not just via reward signals and brute-force post-training — Goodfire AI * Website: https://goodfire.ai * LinkedIn: https://www.linkedin.com/company/goodfire-ai/ * X: https://x.com/GoodfireAI Myra Deng * Website: https://myradeng.com/ * LinkedIn: https://www.linkedin.com/in/myra-deng/ * X: https://x.com/myra_deng Mark Bissell * LinkedIn: https://www.linkedin.com/in/mark-bissell/ * X: https://x.com/MarkMBissell Full Video Episode Timestamps 00:00:00 Introduction 00:00:05 Introduction to the Latent Space Podcast and Guests from Goodfire 00:00:29 What is Goodfire? Mission and Focus on Interpretability 00:01:01 Goodfire’s Practical Approach to Interpretability 00:01:37 Goodfire’s Series B Fundraise Announcement 00:02:04 Backgrounds of Mark and Myra from Goodfire 00:02:51 Team Structure and Roles at Goodfire 00:05:13 What is Interpretability? Definitions and Techniques 00:05:30 Understanding Errors 00:07:29 Post-training vs. Pre-training Interpretability Applications 00:08:51 Using Interpretability to Remove Unwanted Behaviors 00:10:09 Grokking, Double Descent, and Generalization in Models 00:10:15 404 Not Found Explained 00:12:06 Subliminal Learning and Hidden Biases in Models 00:14:07 How Goodfire Chooses Research Directions and Projects 00:15:00 Troubleshooting Errors 00:16:04 Limitations of SAEs and Probes in Interpretability 00:18:14 Rakuten Case Study: Production Deployment of Interpretability 00:20:45 Conclusion 00:21:12 Efficiency Benefits of Interpretability Techniques 00:21:26 Live Demo: Real-Time Steering in a Trillion Parameter Model 00:25:15 How Steering Features are Identified and Labeled 00:26:51 Detecting and Mitigating Hallucinations Using Interpretability 00:31:20 Equivalence of Activation Steering and Prompting 00:34:06 Comparing Steering with Fine-Tuning and LoRA Techniques 00:36:04 Model Design and the Future of Intentional AI Development 00:38:09 Getting Started in Mechinterp: Resources, Programs, and Open Problems 00:40:51 Industry Applications and the Rise of Mechinterp in Practice 00:41:39 Interpretability for Code Models and Real-World Usage 00:43:07 Making Steering Useful for More Than Stylistic Edits 00:46:17 Applying Interpretability to Healthcare and Scientific Discovery 00:49:15 Why Interpretability is Crucial in High-Stakes Domains like Healthcare 00:52:03 Call for Design Partners Across Domains 00:54:18 Interest in World Models and Visual Interpretability 00:57:22 Sci-Fi Inspiration: Ted Chiang and Interpretability 01:00:14 Interpretability, Safety, and Alignment Perspectives 01:04:27 Weak-to-Strong Generalization and Future Alignment Challenges 01:05:38 Final Thoughts and Hiring/Collaboration Opportunities at Goodfire Transcript Shawn Wang [00:00:05]: So welcome to the Latent Space pod. We’re back in the studio with our special MechInterp co-host, Vibhu. Welcome. Mochi, Mochi’s special co-host. And Mochi, the mechanistic interpretability doggo. We have with us Mark and Myra from Goodfire. Welcome. Thanks for having us on. Maybe we can sort of introduce Goodfire and then introduce you guys. How do you introduce Goodfire today? Myra Deng [00:00:29]: Yeah, it’s a great question. So Goodfire, we like to say, is an AI research lab that focuses on using interpretability to understand, learn from, and design AI models. And we really believe that interpretability will unlock the new generation, next frontier of safe and powerful AI models. That’s our description right now, and I’m excited to dive more into the work we’re doing to make that happen. Shawn Wang [00:00:55]: Yeah. And there’s always like the official description. Is there an understatement? Is there an unofficial one that sort of resonates more with a different audience? Mark Bissell [00:01:01]: Well, being an AI research lab that’s focused on interpretability, there’s obviously a lot of people have a lot that they think about when they think of interpretability. And I think we have a pretty broad definition of what that means and the types of places that can be applied. And in particular, applying it in production scenarios, in high stakes industries, and really taking it sort of from the research world into the real world. Which, you know. It’s a new field, so that hasn’t been done all that much. And we’re excited about actually seeing that sort of put into practice. Shawn Wang [00:01:37]: Yeah, I would say it wasn’t too long ago that Anthopic was like still putting out like toy models or superposition and that kind of stuff. And I wouldn’t have pegged it to be this far along. When you and I talked at NeurIPS, you were talking a little bit about your production use cases and your customers. And then not to bury the lead, today we’re also announcing the fundraise, your Series B. $150 million. $150 million at a 1.25B valuation. Congrats, Unicorn. Mark Bissell [00:02:02]: Thank you. Yeah, no, things move fast. Shawn Wang [00:02:04]: We were talking to you in December and already some big updates since then. Let’s dive, I guess, into a bit of your backgrounds as well. Mark, you were at Palantir working on health stuff, which is really interesting because the Goodfire has some interesting like health use cases. I don’t know how related they are in practice. Mark Bissell [00:02:22]: Yeah, not super related, but I don’t know. It was helpful context to know what it’s like. Just to work. Just to work with health systems and generally in that domain. Yeah. Shawn Wang [00:02:32]: And Mara, you were at Two Sigma, which actually I w
Editor’s note: Welcome to our new AI for Science pod, with your new hosts RJ and Brandon! See the writeup on Latent.Space (https://Latent.Space) for more details on why we’re launching 2 new pods this year. RJ Honicky is a co-founder and CTO at MiraOmics (https://miraomics.bio/), building AI models and services for single cell, spatial transcriptomics and pathology slide analysis. Brandon Anderson builds AI systems for RNA drug discovery at Atomic AI (https://atomic.ai). Anything said on this podcast is his personal take — not Atomic’s.—From building molecular dynamics simulations at the University of Washington to red-teaming GPT-4 for chemistry applications and co-founding Future House (a focused research organization) and Edison Scientific (a venture-backed startup automating science at scale)—Andrew White has spent the last five years living through the full arc of AI’s transformation of scientific discovery, from ChemCrow (the first Chemistry LLM agent) triggering White House briefings and three-letter agency meetings, to shipping Kosmos, an end-to-end autonomous research system that generates hypotheses, runs experiments, analyzes data, and updates its world model to accelerate the scientific method itself. * The ChemCrow story: GPT-4 + React + cloud lab automation, released March 2023, set off a storm of anxiety about AI-accelerated bioweapons/chemical weapons, led to a White House briefing (Jake Sullivan presented the paper to the president in a 30-minute block), and meetings with three-letter agencies asking “how does this change breakout time for nuclear weapons research?” * Why scientific taste is the frontier: RLHF on hypotheses didn’t work (humans pay attention to tone, actionability, and specific facts, not “if this hypothesis is true/false, how does it change the world?”), so they shifted to end-to-end feedback loops where humans click/download discoveries and that signal rolls up to hypothesis quality * Cosmos: the full scientific agent with a world model (distilled memory system, like a Git repo for scientific knowledge) that iterates on hypotheses via literature search, data analysis, and experiment design—built by Ludo after weeks of failed attempts, the breakthrough was putting data analysis in the loop (literature alone didn’t work) * Why molecular dynamics and DFT are overrated: “MD and DFT have consumed an enormous number of PhDs at the altar of beautiful simulation, but they don’t model the world correctly—you simulate water at 330 Kelvin to get room temperature, you overfit to validation data with GGA/B3LYP functionals, and real catalysts (grain boundaries, dopants) are too complicated for DFT” * The AlphaFold vs. DE Shaw Research counterfactual: DE Shaw built custom silicon, taped out chips with MD algorithms burned in, ran MD at massive scale in a special room in Times Square, and David Shaw flew in by helicopter to present—Andrew thought protein folding would require special machines to fold one protein per day, then AlphaFold solved it in Google Colab on a desktop GPU * The E3 Zero reward hacking saga: trained a model to generate molecules with specific atom counts (verifiable reward), but it kept exploiting loopholes, then a Nature paper came out that year proving six-nitrogen compounds are possible under extreme conditions, then it started adding nitrogen gas (purchasable, doesn’t participate in reactions), then acid-base chemistry to move one atom, and Andrew ended up “building a ridiculous catalog of purchasable compounds in a Bloom filter” to close the loop Andrew White * FutureHouse: http://futurehouse.org/ * Edison Scientific: http://edisonscientific.com/ * X: https://x.com/andrewwhite01 * Cosmos paper: https://futurediscovery.org/cosmos Full Video Episode Timestamps 00:00:00 Introduction: Andrew White on Automating Science with Future House and Edison Scientific00:02:22 The Academic to Startup Journey: Red Teaming GPT-4 and the ChemCrow Paper00:11:35 Future House Origins: The FRO Model and Mission to Automate Science00:12:32 Resigning Tenure: Why Leave Academia for AI Science00:15:54 What Does ‘Automating Science’ Actually Mean?00:17:30 The Lab-in-the-Loop Bottleneck: Why Intelligence Isn’t Enough00:18:39 Scientific Taste and Human Preferences: The 52% Agreement Problem00:20:05 Paper QA, Robin, and the Road to Cosmos00:21:57 World Models as Scientific Memory: The GitHub Analogy00:40:20 The Bitter Lesson for Biology: Why Molecular Dynamics and DFT Are Overrated00:43:22 AlphaFold’s Shock: When First Principles Lost to Machine Learning00:46:25 Enumeration and Filtration: How AI Scientists Generate Hypotheses00:48:15 CBRN Safety and Dual-Use AI: Lessons from Red Teaming01:00:40 The Future of Chemistry is Language: Multimodal Debate01:08:15 Ether Zero: The Hilarious Reward Hacking Adventures01:10:12 Will Scientists Be Displaced? Jevons Paradox and Infinite Discovery01:13:46 Cosmos in Practice: Open Access and Enterprise Partnerships This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From shipping Gemini Deep Think and IMO Gold to launching the Reasoning and AGI team in Singapore, Yi Tay has spent the last 18 months living through the full arc of Google DeepMind’s pivot from architecture research to RL-driven reasoning—watching his team go from a dozen researchers to 300+, training models that solve International Math Olympiad problems in a live competition, and building the infrastructure to scale deep thinking across every domain, and driving Gemini to the top of the leaderboards across every category. Yi Returns to dig into the inside story of the IMO effort and more! We discuss: * Yi’s path: Brain → Reka → Google DeepMind → Reasoning and AGI team Singapore, leading model training for Gemini Deep Think and IMO Gold * The IMO Gold story: four co-captains (Yi in Singapore, Jonathan in London, Jordan in Mountain View, and Tong leading the overall effort), training the checkpoint in ~1 week, live competition in Australia with professors punching in problems as they came out, and the tension of not knowing if they’d hit Gold until the human scores came in (because the Gold threshold is a percentile, not a fixed number) * Why they threw away AlphaProof: “If one model can’t do it, can we get to AGI?” The decision to abandon symbolic systems and bet on end-to-end Gemini with RL was bold and non-consensus * On-policy vs. off-policy RL: off-policy is imitation learning (copying someone else’s trajectory), on-policy is the model generating its own outputs, getting rewarded, and training on its own experience—”humans learn by making mistakes, not by copying” * Why self-consistency and parallel thinking are fundamental: sampling multiple times, majority voting, LM judges, and internal verification are all forms of self-consistency that unlock reasoning beyond single-shot inference * The data efficiency frontier: humans learn from 8 orders of magnitude less data than models, so where’s the bug? Is it the architecture, the learning algorithm, backprop, off-policyness, or something else? * Three schools of thought on world models: (1) Genie/spatial intelligence (video-based world models), (2) Yann LeCun’s JEPA + FAIR’s code world models (modeling internal execution state), (3) the amorphous “resolution of possible worlds” paradigm (curve-fitting to find the world model that best explains the data) * Why AI coding crossed the threshold: Yi now runs a job, gets a bug, pastes it into Gemini, and relaunches without even reading the fix—”the model is better than me at this” * The Pokémon benchmark: can models complete Pokédex by searching the web, synthesizing guides, and applying knowledge in a visual game state? “Efficient search of novel idea space is interesting, but we’re not even at the point where models can consistently apply knowledge they look up” * DSI and generative retrieval: re-imagining search as predicting document identifiers with semantic tokens, now deployed at YouTube (symmetric IDs for RecSys) and Spotify * Why RecSys and IR feel like a different universe: “modeling dynamics are strange, like gravity is different—you hit the shuttlecock and hear glass shatter, cause and effect are too far apart” * The closed lab advantage is increasing: the gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that play well with everything built before * Why ideas still matter: “the last five years weren’t just blind scaling—transformers, pre-training, RL, self-consistency, all had to play well together to get us here” * Gemini Singapore: hiring for RL and reasoning researchers, looking for track record in RL or exceptional achievement in coding competitions, and building a small, talent-dense team close to the frontier — Yi Tay * Google DeepMind: https://deepmind.google * X: https://x.com/YiTayML Full Video Episode Timestamps 00:00:00 Introduction: Returning to Google DeepMind and the Singapore AGI Team00:04:52 The Philosophy of On-Policy RL: Learning from Your Own Mistakes00:12:00 IMO Gold Medal: The Journey from AlphaProof to End-to-End Gemini00:21:33 Training IMO Cat: Four Captains Across Three Time Zones00:26:19 Pokemon and Long-Horizon Reasoning: Beyond Academic Benchmarks00:36:29 AI Coding Assistants: From Lazy to Actually Useful00:32:59 Reasoning, Chain of Thought, and Latent Thinking00:44:46 Is Attention All You Need? Architecture, Learning, and the Local Minima00:55:04 Data Efficiency and World Models: The Next Frontier01:08:12 DSI and Generative Retrieval: Reimagining Search with Semantic IDs01:17:59 Building GDM Singapore: Geography, Talent, and the Symposium01:24:18 Hiring Philosophy: High Stats, Research Taste, and Student Budgets01:28:49 Health, HRV, and Research Performance: The 23kg Journey This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From building internal AI labs to becoming CTO of Brex, James Reggio has helped lead one of the most disciplined AI transformations inside a real financial institution where compliance, auditability, and customer trust actually matter. We sat down with Reggio to unpack Brex’s three-pillar AI strategy (corporate, operational, and product AI) [https://www.brex.com/journal/brex-ai-native-operations], how SOP-driven agents beat overengineered RL in ops, why Brex lets employees “build their own AI stack” instead of picking winners [https://www.conductorone.com/customers/brex/], and how a small, founder-heavy AI team is shipping production agents to 40,000+ companies. Reggio also goes deep on Brex’s multi-agent “network” architecture, evals for multi-turn systems, agentic coding’s second-order effects on codebase understanding, and why the future of finance software looks less like dashboards and more like executive assistants coordinating specialist agents behind the scenes. We discuss: * Brex’s three-pillar AI strategy: corporate AI for 10x employee workflows, operational AI for cost and compliance leverage, and product AI that lets customers justify Brex as part of their AI strategy to the board * Why SOP-driven agents beat overengineered RL in finance ops, and how breaking work into auditable, repeatable steps unlocked faster automation in KYC, underwriting, fraud, and disputes * Building an internal AI platform early: LLM gateways, prompt/version management, evals, cost observability, and why platform work quietly became the force multiplier behind everything else * Multi-agent “networks” vs single-agent tools: why Brex’s EA-style assistant coordinates specialist agents (policy, travel, reimbursements) through multi-turn conversations instead of one-shot tool calls * The audit agent pattern: separating detection, judgment, and follow-up into different agents to reduce false negatives without overwhelming finance teams * Centralized AI teams without resentment: how Brex avoided “AI envy” by tying work to business impact and letting anyone transfer in if they cared deeply enough * Letting employees build their own AI stack: ChatGPT vs Claude vs Gemini, Cursor vs Windsurf, and why Brex refuses to pick winners in fast-moving tool races * Measuring adoption without vanity metrics: why “% of code written by AI” is the wrong KPI and what second-order effects (slop, drift, code ownership) actually matter * Evals in the real world: regression tests from ops QA, LLM-as-judge for multi-turn agents, and why integration-style evals break faster than you expect * Teaching AI fluency at scale: the user → advocate → builder → native framework, ops-led training, spot bonuses, and avoiding fear-based adoption * Re-interviewing the entire engineering org: using agentic coding interviews internally to force hands-on skill upgrades without formal performance scoring * Headcount in the age of agents: why Brex grew the business without growing engineering, and why AI amplifies bad architecture as fast as good decisions * The future of finance software: why dashboards fade, assistants take over, and agent-to-agent collaboration becomes the real UI — James Reggio * X: https://x.com/jamesreggio * LinkedIn: https://www.linkedin.com/in/jamesreggio/ Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction00:01:24 From Mobile Engineer to CTO: The Founder's Path00:03:00 Quitters Welcome: Building a Founder-Friendly Culture00:05:13 The AI Team Structure: 10-Person Startup Within Brex00:11:55 Building the Brex Agent Platform: Multi-Agent Networks00:13:45 Tech Stack Decisions: TypeScript, Mastra, and MCP00:24:32 Operational AI: Automating Underwriting, KYC, and Fraud00:16:40 The Brex Assistant: Executive Assistant for Every Employee00:40:26 Evaluation Strategy: From Simple SOPs to Multi-Turn Evals00:37:11 Agentic Coding Adoption: Cursor, Windsurf, and the Engineering Interview00:58:51 AI Fluency Levels: From User to Native01:09:14 The Audit Agent Network: Finance Team Agents in Action01:03:33 The Future of Engineering Headcount and AI Leverage This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we’ll explain in the next State of Latent Space post, we’ll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates! We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross’ AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities. We have chatted with both Clementine Fourrier of HuggingFace’s OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use. George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really? We discuss: * The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx’s retweet * Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers * The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints * How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard) * The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs * Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \”I don’t know\”), and Claude models lead with the lowest hallucination rates despite not always being the smartest * GDP Val AA: their version of OpenAI’s GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias) * The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron) * The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents) * Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future * Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions) * V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models) Links to Artificial Analysis * Website: https://artificialanalysis.ai * George Cameron on X: https://x.com/georgecameron * Micah-Hill Smith on X: https://x.com/micahhsmith Full Episode on YouTube Timestamps * 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins * 01:19 Business Model: Independence and Revenue Streams * 04:33 Origin Story: From Legal AI to Benchmarking Need * 16:22 AI Grant and Moving to San Francisco * 19:21 Intelligence Index Evolution: From V1 to V3 * 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology * 13:52 Mystery Shopper Policy and Maintaining Independence * 28:01 New Benchmarks: Omissions Index for Hallucination Detection * 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning * 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks * 50:19 Stirrup Agent Harness: Open Source Agentic Framework * 52:43 Openness Index: Measuring Model Transparency Beyond Licenses * 58:25 The Smiling Curve: Cost Falling While Spend Rising * 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits * 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges * 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas * 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions * 1:16:50 Closing: The Insatiable Demand for Intelligence Transcript Micah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing. swyx [00:00:17]: Which was January 2024. I don’t even remember doing that, but yeah, it was very influential to me. Yeah, I’m looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it’s an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I’ve been following your progress. Congrats on... It’s been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that... George [00:01:09]: Yeah, but you can’t pay us for better results. swyx [00:01:12]: Yes, exactly. George [00:01:13]: Very important. Micah [00:01:14]: Start off with a spicy take. swyx [00:01:18]: Okay, how do I pay you? Micah [00:01:20]: Let’s get right into that. swyx [00:01:21]: How do you make money? Micah [00:01:24]: Well, very happy to talk about that. So it’s been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We’re very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We’ve got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We’ve been very clear about that from the very start because there’s no use doing what we do unless it’s independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff. swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that? George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it’s hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that’s very different from the public benchmarking that we publicize, and there’s no commercial model around that. For private benchmarking, we’ll at times create benchmarks, run benchmarks to specs that enterprises want. And we’ll also do that sometimes for AI companies who have built things, and we help them understand what they’ve built with private benchmarking. Yeah. So that’s a piece mainly that we’ve developed through trying to support everybody publicly with our public benchmarks. Yeah. swyx [00:04:09]: Let’s talk about TechStack behind that. But okay, I’m going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me. Micah [00:04:19]: George was an SF, but he’s Australian, but he moved here already. Yeah. swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting
We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B valuation, with $30M annualized consumption revenue (aka $2.5m MRR) after their September evals product launch. —- From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they’re spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper’s claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can’t pay to get on, can’t pay to get off, and scores reflect millions of real votes), how they’re expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users. We discuss: * The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent * The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in * The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena’s response demolished the paper’s factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves) * Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina’s nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities * The Nano Banana moment: changed Google’s market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science * New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soon Full Video Episode Timestamps 00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup00:02:47 The Decision to Start a Company: Scaling Beyond Academia00:03:38 The $100M Raise: Use of Funds and Platform Economics00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation00:08:12 Educational Value and Learning from the Community00:08:41 Technical Migration: From Gradio to React and Platform Evolution00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity00:12:29 Nano Banana Moment: How Preview Models Create Market Impact00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals00:19:10 API Strategy and Focus: Doing One Thing Well00:19:51 Community Management and Retention: Sign-In, History, and Daily Value00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses00:21:49 Hiring and Building a High-Performance Team This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it’s not just about depth, it’s about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the “critical depth” phenomenon where performance doesn’t just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn’t just “make networks bigger” but a fundamental shift in RL objectives (their code doesn’t have a line saying “maximize rewards”—it’s pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work. We discuss: * The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem * Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the “critical depth” phenomenon * Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance * The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn’t the bottleneck, and crossing 15M+ transitions was when deep networks really paid off * The blurring of RL and self-supervised learning: their code doesn’t maximize rewards directly, it’s an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression * Why scaling batch size unlocks at depth: traditional RL doesn’t benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension — RL1000 Team (Princeton) * 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1 Full Video Episode Timestamps 00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience00:01:11 Team Introductions and Princeton Research Origins00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow00:04:35 Self-Supervised RL: A Different Approach to Scaling00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning00:09:44 From TD Errors to Classification: Why This Objective Scales00:11:06 Architecture Details: Building on Braw and SymbaFowl00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure00:18:05 World Models and Next State Classification00:22:37 Unlocking Batch Size Scaling Through Network Capacity00:24:10 Compute Requirements: State-of-the-Art on a Single GPU00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin’s launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just “more repos,” why Tau-bench’s “impossible tasks” controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition’s emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning. We discuss: * John’s path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks * The SWE-bench origin story: released October 2023, mostly ignored until Cognition’s Devin launch kicked off the arms race (Walden emailed John two weeks before: “we have a good number”) * SWE-bench Verified: the curated, high-quality split that became the standard for serious evals * SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution * The SWE-bench Pro controversy: independent authors used the “SWE-bench” name without John’s blessing, but he’s okay with it (”congrats to them, it’s a great benchmark”) * CodeClash: John’s new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization) * SWE-Efficiency (Jeffrey Maugh, John’s high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations) * AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation) * The Tau-bench “impossible tasks” debate: some tasks are underspecified or impossible, but John thinks that’s actually a feature (flags cheating if you score above 75%) * Cognition’s research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents) * The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve — John Yang * SWE-bench: https://www.swebench.com * X: https://x.com/jyangballin Full Video Episode Timestamps 00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI’s post-training evolution—from the PPO vs DPO debates of 2023 to today’s RLVR era, where the real innovation isn’t optimization methods but data quality, signal trust, and token efficiency. We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can’t), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels “trapped” by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL (”way more moving parts than pre-training”), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn’t producing enough people who can do both distributed systems and ML research—the exact skill set required to push the frontier when the bottleneck moves every few weeks. We discuss: * Josh’s path: pre-training data curation → post-training researcher at OpenAI, shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model * Why he switched from pre-training to post-training: “Do I want to make 3% compute efficiency wins, or change behavior by 40%?” * The RL infrastructure challenge: way more moving parts than pre-training (tasks, grading setups, external partners), and why babysitting runs at 12:30am means jumping into unfamiliar code constantly * How Codex has changed his workflow: 40-minute design sessions compressed into 15-minute agent sprints, and the strange “trapped” feeling of waiting for the agent to finish * The RLHF vs RLVR debate: both are policy gradient methods, the real difference is data quality and signal trust (human preference vs. verifiable correctness) * Why GRPO (from DeepSeek Math) was underappreciated: not just an optimization trick, but a shift toward reward signals you can actually trust (math answers over human vibes) * The token efficiency revolution: GPT-5 to 5.1 bumped evals and slashed tokens, and why thinking in tokens (not wall-clock time) unlocks better tool-calling and agent workflows * Personality toggles: Anton (tool, no warmth) vs Clippy (friendly, helpful), and why Josh uses custom instructions to make his model “just a tool” * The router problem: having a router at the top (GPT-5 thinking vs non-thinking) and an implicit router (thinking effort slider) creates weird bumps, and why the abstractions will eventually merge * Long context: climbing Graph Blocks evals, the dream of 10M+ token windows, and why agents + graph walks might matter more than raw context length * Why the education system isn’t producing enough people who can do both distributed systems and ML research, and why that’s the bottleneck for frontier labs * The 2026 vision: neither pre-training nor post-training is dead, we’re in the fog of war, and the bottleneck will keep moving (so emotional stability helps) — Josh McGrath * OpenAI: https://openai.com * X: https://x.com/j_mcgraph Full Video Episode Timestamps 00:00:00 Introduction: Josh McGrath on Post-Training at OpenAI00:04:37 The Shopping Model: Black Friday Launch and Interruptability00:07:11 Model Personality and the Anton vs Clippy Divide00:08:26 Beyond PPO vs DPO: The Data Quality Spectrum in RL00:01:40 Infrastructure Challenges: Why Post-Training RL is Harder Than Pre-Training00:13:12 Token Efficiency: The 2D Plot That Matters Most00:03:45 Codex Max and the Flow Problem: 40 Minutes of Planning, 15 Minutes of Waiting00:17:29 Long Context and Graph Blocks: Climbing Toward Perfect Context00:21:23 The ML-Systems Hybrid: What's Hard to Hire For00:24:50 Pre-Training Isn't Dead: Living Through Technological Revolution This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From Berkeley robotics and OpenAI’s 2017 Dota-era internship to shipping RL breakthroughs on GPT-4o, o1, and o3, and now leading model development at Cursor, Ashvin Nair has done it all. We caught up with Ashvin at NeurIPS 2025 to dig into the inside story of OpenAI’s reasoning team (spoiler: it went from a dozen people to 300+), why IOI Gold felt reachable in 2022 but somehow didn’t change the world when o1 actually achieved it, how RL doesn’t generalize beyond the training distribution (and why that means you need to bring economically useful tasks into distribution by co-designing products and models), the deeper lessons from the RL research era (2017–2022) and why most of it didn’t pan out because the community overfitted to benchmarks, how Cursor is uniquely positioned to do continual learning at scale with policy updates every two hours and product-model co-design that keeps engineers in the loop instead of context-switching into ADHD hell, and his bet that the next paradigm shift is continual learning with infinite memory—where models experience something once (a bug, a mistake, a user pattern) and never forget it, storing millions of deployment tokens in weights without overloading capacity. We discuss: * Ashvin’s path: Berkeley robotics PhD → OpenAI 2017 intern (Dota era) → o1/o3 reasoning team → Cursor ML lead in three months * Why robotics people are the most grounded at NeurIPS (they work with the real world) and simulation people are the most unhinged (Lex Fridman’s take) * The IOI Gold paradox: “If you told me we’d achieve IOI Gold in 2022, I’d assume we could all go on vacation—AI solved, no point working anymore. But life is still the same.” * The RL research era (2017–2022) and why most of it didn’t pan out: overfitting to benchmarks, too many implicit knobs to tune, and the community rewarding complex ideas over simple ones that generalize * Inside the o1 origin story: a dozen people, conviction from Ilya and Jakob Pachocki that RL would work, small-scale prototypes producing “surprisingly accurate reasoning traces” on math, and first-principles belief that scaled * The reasoning team grew from ~12 to 300+ people as o1 became a product and safety, tooling, and deployment scaled up * Why Cursor is uniquely positioned for continual learning: policy updates every two hours (online RL on tab), product and ML sitting next to each other, and the entire software engineering workflow (code, logs, debugging, DataDog) living in the product * Composer as the start of product-model co-design: smart enough to use, fast enough to stay in the loop, and built by a 20–25 person ML team with high-taste co-founders who code daily * The next paradigm shift: continual learning with infinite memory—models that experience something once (a bug, a user mistake) and store it in weights forever, learning from millions of deployment tokens without overloading capacity (trillions of pretraining tokens = plenty of room) * Why off-policy RL is unstable (Ashvin’s favorite interview question) and why Cursor does two-day work trials instead of whiteboard interviews * The vision: automate software engineering as a process (not just answering prompts), co-design products so the entire workflow (write code, check logs, debug, iterate) is in-distribution for RL, and make models that never make the same mistake twice — Ashvin Nair * Cursor: https://cursor.com * X: https://x.com/ashvinnair_ Full Video Episode Timestamps 00:00:00 Introduction: From Robotics to Cursor via OpenAI00:01:58 The Robotics to LLM Agent Transition: Why Code Won00:09:11 RL Research Winter and Academic Overfitting00:11:45 The Scaling Era and Moving Goalposts: IOI Gold Doesn't Mean AGI00:21:30 OpenAI's Reasoning Journey: From Codex to O100:20:03 The Blip: Thanksgiving 2023 and OpenAI Governance00:22:39 RL for Reasoning: The O-Series Conviction and Scaling00:25:47 O1 to O3: Smooth Internal Progress vs External Hype Cycles00:33:07 Why Cursor: Co-Designing Products and Models for Real Work00:34:14 Composer and the Future: Online Learning Every Two Hours00:35:15 Continual Learning: The Missing Paradigm Shift00:44:00 Hiring at Cursor and Why Off-Policy RL is Unstable This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From investing through the modern data stack era (DBT, Fivetran, and the analytics explosion) to now investing at the frontier of AI infrastructure and applications at Amplify Partners, Sarah Catanzaro has spent years at the intersection of data, compute, and intelligence—watching categories emerge, merge, and occasionally disappoint. We caught up with Sarah live at NeurIPS 2025 to dig into the state of AI startups heading into 2026: why $100M+ seed rounds with no near-term roadmap are now the norm (and why that terrifies her), what the DBT-Fivetran merger really signals about the modern data stack (spoiler: it’s not dead, just ready for IPO), how frontier labs are using DBT and Fivetran to manage training data and agent analytics at scale, why data catalogs failed as standalone products but might succeed as metadata services for agents, the consumerization of AI and why personalization (memory, continual learning, K-factor) is the 2026 unlock for retention and growth, why she thinks RL environments are a fad and real-world logs beat synthetic clones every time, and her thesis for the most exciting AI startups: companies that marry hard research problems (RAG, rule-following, continual learning) with killer applications that were simply impossible before. We discuss: * The DBT-Fivetran merger: not the death of the modern data stack, but a path to IPO scale (targeting $600M+ combined revenue) and a signal that both companies were already winning their categories * How frontier labs use data infrastructure: DBT and Fivetran for training data curation, agent analytics, and managing increasingly complex interactions—plus the rise of transactional databases (RocksDB) and efficient data loading (Vortex) for GPU-bound workloads * Why data catalogs failed: built for humans when they should have been built for machines, focused on discoverability when the real opportunity was governance, and ultimately subsumed as features inside Snowflake, DBT, and Fivetran * The $100M+ seed phenomenon: raising massive rounds at billion-dollar valuations with no 6-month roadmap, seven-day decision windows, and founders optimizing for signal (”we’re a unicorn”) over partnership or dilution discipline * Why world models are overhyped but underspecified: three competing definitions, unclear generalization across use cases (video games ≠ robotics ≠ autonomous driving), and a research problem masquerading as a product category * The 2026 theme: consumerization of AI via personalization—memory management, continual learning, and solving retention/churn by making products learn skills, preferences, and adapt as the world changes (not just storing facts in cursor rules) * Why RL environments are a fad: labs are paying 7–8 figures for synthetic clones when real-world logs, traces, and user activity (à la Cursor) are richer, cheaper, and more generalizable * Sarah’s investment thesis: research-driven applications that solve hard technical problems (RAG for Harvey, rule-following for Sierra, continual learning for the next killer app) and unlock experiences that were impossible before * Infrastructure bets: memory, continual learning, stateful inference, and the systems challenges of loading/unloading personalized weights at scale * Why K-factor and growth fundamentals matter again: AI felt magical in 2023–2024, but as the magic fades, retention and virality are back—and most AI founders have never heard of K-factor — Sarah Catanzaro * X: https://x.com/sarahcat21 * Amplify Partners: https://amplifypartners.com/ Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction: Sarah Catanzaro's Journey from Data to AI00:01:02 The DBT-Fivetran Merger: Not the End of the Modern Data Stack00:05:26 Data Catalogs and What Went Wrong00:08:16 Data Infrastructure at AI Labs: Surprising Insights00:10:13 The Crazy Funding Environment of 2024-202500:17:18 World Models: Hype, Confusion, and Market Potential00:18:59 Memory Management and Continual Learning: The Next Frontier00:23:27 Agent Environments: Just a Fad?00:25:48 The Perfect AI Startup: Research Meets Application00:28:02 Closing Thoughts and Where to Find Sarah This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
One year ago, Anthropic launched the Model Context Protocol (MCP)—a simple, open standard to connect AI applications to the data and tools they need. Today, MCP has exploded from a local-only experiment into the de facto protocol for agentic systems, adopted by OpenAI, Microsoft, Google, Block, and hundreds of enterprises building internal agents at scale. And now, MCP is joining the newly formed Agentic AI Foundation (AAIF) under the Linux Foundation, alongside Block’s Goose coding agent, with founding members spanning the biggest names in AI and cloud infrastructure. We sat down with David Soria Parra (MCP lead, Anthropic), Nick Cooper (OpenAI), Brad Howes (Block / Goose), and Jim Zemlin (Linux Foundation CEO) to dig into the one-year journey of MCP—from Thanksgiving hacking sessions and the first remote authentication spec to long-running tasks, MCP Apps, and the rise of agent-to-agent communication—and the behind-the-scenes story of how three competitive AI labs came together to donate their protocols and agents to a neutral foundation, why enterprises are deploying MCP servers faster than anyone expected (most of it invisible, internal, and at massive scale), what it takes to design a protocol that works for both simple tool calls and complex multi-agent orchestration, how the foundation will balance taste-making (curating meaningful projects) with openness (avoiding vendor lock-in), and the 2025 vision: MCP as the communication layer for asynchronous, long-running agents that work while you sleep, discover and install their own tools, and unlock the next order of magnitude in AI productivity. We discuss: * The one-year MCP journey: from local stdio servers to remote HTTP streaming, OAuth 2.1 authentication (and the enterprise lessons learned), long-running tasks, and MCP Apps (iframes for richer UI) * Why MCP adoption is exploding internally at enterprises: invisible, internal servers connecting agents to Slack, Linear, proprietary data, and compliance-heavy workflows (financial services, healthcare) * The authentication evolution: separating resource servers from identity providers, dynamic client registration, and why the March spec wasn’t enterprise-ready (and how June fixed it) * How Anthropic dogfoods MCP: internal gateway, custom servers for Slack summaries and employee surveys, and why MCP was born from “how do I scale dev tooling faster than the company grows?” * Tasks: the new primitive for long-running, asynchronous agent operations—why tools aren’t enough, how tasks enable deep research and agent-to-agent handoffs, and the design choice to make tasks a “container” (not just async tools) * MCP Apps: why iframes, how to handle styles and branding, seat selection and shopping UIs as the killer use case, and the collaboration with OpenAI to build a common standard * The registry problem: official registry vs. curated sub-registries (Smithery, GitHub), trust levels, model-driven discovery, and why MCP needs “npm for agents” (but with signatures and HIPAA/financial compliance) * The founding story of AAIF: how Anthropic, OpenAI, and Block came together (spoiler: they didn’t know each other were talking to Linux Foundation), why neutrality matters, and how Jim Zemlin has never seen this much day-one inbound interest in 22 years — David Soria Parra (Anthropic / MCP) * MCP: https://modelcontextprotocol.io * https://uk.linkedin.com/in/david-soria-parra-4a78b3a * https://x.com/dsp_ Nick Cooper (OpenAI) * X: https://x.com/nicoaicopr Brad Howes (Block / Goose) * Goose: https://github.com/block/goose Jim Zemlin (Linux Foundation) * LinkedIn: https://www.linkedin.com/in/zemlin/ Agentic AI Foundation * https://agenticai.foundation Full Video Episode Timestamps 00:00:00 Introduction: MCP's First Year and Foundation Launch00:01:17 MCP's Journey: From Launch to Industry Standard00:02:06 Protocol Evolution: Remote Servers and Authentication00:08:52 Enterprise Authentication and Financial Services00:11:42 Transport Layer Challenges: HTTP Streaming and Scalability00:15:37 Standards Development: Collaboration with Tech Giants00:34:27 Long-Running Tasks: The Future of Async Agents00:30:41 Discovery and Registries: Building the MCP Ecosystem00:30:54 MCP Apps and UI: Beyond Text Interfaces00:26:55 Internal Adoption: How Anthropic Uses MCP00:23:15 Skills vs MCP: Complementary Not Competing00:36:16 Community Events and Enterprise Learnings01:03:31 Foundation Formation: Why Now and Why Together01:07:38 Linux Foundation Partnership: Structure and Governance01:11:13 Goose as Reference Implementation01:17:28 Principles Over Roadmaps: Composability and Quality01:21:02 Foundation Value Proposition: Why Contribute01:27:49 Practical Investments: Events, Tools, and Community01:34:58 Looking Ahead: Async Agents and Real Impact This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Note: Steve and Gene’s talk on Vibe Coding and the post IDE world was one of the top talks of AIE CODE: From building legendary platforms at Google and Amazon to authoring one of the most influential essays on AI-powered development (Revenge of the Junior Developer, quoted by Dario Amodei himself), Steve Yegge has spent decades at the frontier of software engineering—and now he’s leading the charge into what he calls the “factory farming” era of code. After stints at SourceGraph and building Beads (a purely vibe-coded issue tracker with tens of thousands of users), Steve co-authored The Vibe Coding Book and is now building VC (VibeCoder), an agent orchestration dashboard designed to move developers from writing code to managing fleets of AI agents that coordinate, parallelize, and ship features while you sleep. We sat down with Steve at AI Engineer Summit to dig into why Claude Code, Cursor, and the entire 2024 stack are already obsolete, what it actually takes to trust an agent after 2,000 hours of practice (hint: they will delete your production database if you anthropomorphize them), why the real skill is no longer writing code but orchestrating agents like a NASCAR pit crew, how merging has become the new wall that every 10x-productive team is hitting (and why one company’s solution is literally “one engineer per repo”), the rise of multi-agent workflows where agents reserve files, message each other via MCP, and coordinate like a little village, why Steve believes if you’re still using an IDE to write code by January 1st, you’re a bad engineer, how the 12–15 year experience bracket is the most resistant demographic (and why their identity is tied to obsolete workflows), the hidden chaos inside OpenAI, Anthropic, and Google as they scale at breakneck speed, why rewriting from scratch is now faster than refactoring for a growing class of codebases, and his 2025 prediction: we’re moving from subsistence agriculture to John Deere-scale factory farming of code, and the Luddite backlash is only just beginning. We discuss: * Why Claude Code, Cursor, and agentic coding tools are already last year’s tech—and what comes next: agent orchestration dashboards where you manage fleets, not write lines * The 2,000-hour rule: why it takes a full year of daily use before you can predict what an LLM will do, and why trust = predictability, not capability * Steve’s hot take: if you’re still using an IDE to develop code by January 1st, 2025, you’re a bad engineer—because the abstraction layer has moved from models to full-stack agents * The demographic most resistant to vibe coding: 12–15 years of experience, senior engineers whose identity is tied to the way they work today, and why they’re about to become the interns * Why anthropomorphizing LLMs is the biggest mistake: the “hot hand” fallacy, agent amnesia, and how Steve’s agent once locked him out of prod by changing his password to “fix” a problem * Should kids learn to code? Steve’s take: learn to vibe code—understand functions, classes, architecture, and capabilities in a language-neutral way, but skip the syntax * The 2025 vision: “factory farming of code” where orchestrators run Cloud Code, scrub output, plan-implement-review-test in loops, and unlock programming for non-programmers at scale — Steve Yegge * X: https://x.com/steve_yegge * Substack (Stevie’s Tech Talks): https://steve-yegge.medium.com/ * GitHub (VC / VibeCoder): https://github.com/yegge-labs Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Thumbnails 00:00:00 Introduction: Steve Yegge on Vibe Coding and AI Engineering00:00:59 The Backlash: Who Resists Vibe Coding and Why00:04:26 The 2000 Hour Rule: Building Trust with AI Coding Tools00:03:31 The January 1st Deadline: IDEs Are Becoming Obsolete00:02:55 10X Productivity at OpenAI: The Performance Review Problem00:07:49 The Hot Hand Fallacy: When AI Agents Betray Your Trust00:11:12 Claude Code Isn't It: The Need for Agent Orchestration00:15:20 The Orchestrator Revolution: From Cloud Code to Agent Villages00:18:46 The Merge Wall: The Biggest Unsolved Problem in AI Coding00:26:33 Never Rewrite Your Code - Until Now: Joel Spolsky Was Wrong00:22:43 Factory Farming Code: The John Deere Era of Software00:29:27 Google's Gemini Turnaround and the AI Lab Chaos00:33:20 Should Your Kids Learn to Code? The New Answer00:34:59 Code MCP and the Gossip Rate: Latest Vibe Coding Discoveries This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From the frontlines of OpenAI’s Codex and GPT-5 training teams, Bryan and Bill are building the future of AI-powered coding—where agents don’t just autocomplete, they architect, refactor, and ship entire features while you sleep. We caught up with them at AI Engineer Conference right after the launch of Codex Max, OpenAI’s newest long-running coding agent designed to work for 24+ hours straight, manage its own context, and spawn sub-agents to parallelize work across your entire codebase. We sat down with Bryan and Bill to dig into what it actually takes to train a model that developers trust—why personality, communication, and planning matter as much as raw capability, how Codex is trained with strong opinions about tools (it loves rg over grep, seriously), why the abstraction layer is moving from models to full-stack agents you can plug into VS Code or Zed, how OpenAI partners co-develop tool integrations and discover unexpected model habits (like renaming tools to match Codex’s internal training), the rise of applied evals that measure real-world impact instead of academic benchmarks, why multi-turn evals are the next frontier (and Bryan’s “job interview eval” idea), how coding agents are breaking out of code into personal automation, terminal workflows, and computer use, and their 2026 vision: coding agents trusted enough to handle the hardest refactors at any company, not just top-tier firms, and general enough to build integrations, organize your desktop, and unlock capabilities you’d never get access to otherwise. We discuss: * What Codex Max is: a long-running coding agent that can work 24+ hours, manage its own context window, and spawn sub-agents for parallel work * Why the name “Max”: maximalist, maximization, speed and endurance—it’s simply better and faster for the same problems * Training for personality: communication, planning, context gathering, and checking your work as behavioral characteristics, not just capabilities * How Codex develops habits like preferring rg over grep, and why renaming tools to match its training (e.g., terminal-style naming) dramatically improves tool-call performance * The split between Codex (opinionated, agent-focused, optimized for the Codex harness) and GPT-5 (general, more durable across different tools and modalities) * Why the abstraction layer is moving up: from prompting models to plugging in full agents (Codex, GitHub Copilot, Zed) that package the entire stack * The rise of sub-agents and agents-using-agents: Codex Max spawning its own instances, handing off context, and parallelizing work across a codebase * How OpenAI works with coding partners on the bleeding edge to co-develop tool integrations and discover what the model is actually good at * The shift to applied evals: capturing real-world use cases instead of academic benchmarks, and why ~50% of OpenAI employees now use Codex daily * Why multi-turn evals are the next frontier: LM-as-a-judge for entire trajectories, Bryan’s “job interview eval” concept, and the need for a batch multi-turn eval API * How coding agents are breaking out of code: personal automation, organizing desktops, terminal workflows, and “Devin for non-coding” use cases * Why Slack is the ultimate UI for work, and how coding agents can become your personal automation layer for email, files, and everything in between * The 2026 vision: more computer use, more trust, and coding agents capable enough that any company can access top-tier developer capabilities, not just elite firms — Bryan & Bill (OpenAI Codex Team) * http://x.com/bfioca * https://x.com/realchillben * OpenAI Codex: https://openai.com/index/openai-codex/ Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction: Latent Space Listeners at AI Engineer Code00:01:27 Codex Max Launch: Training for Long-Running Coding Agents00:03:01 Model Personality and Trust: Communication, Planning, and Self-Checking00:05:20 Codex vs GPT-5: Opinionated Agents vs General Models00:07:47 Tool Use and Model Habits: The Ripgrep Discovery00:09:16 Personality Design: Verbosity vs Efficiency in Coding Agents00:11:56 The Agent Abstraction Layer: Building on Top of Codex00:14:08 Sub-Agents and Multi-Agent Patterns: The Future of Composition00:16:11 Trust and Adoption: OpenAI Developers Using Codex Daily00:17:21 Applied Evals: Real-World Testing vs Academic Benchmarks00:19:15 Multi-Turn Evals and the Job Interview Pattern00:21:35 Feature Request: Batch Multi-Turn Eval API00:22:28 Beyond Code: Personal Automation and Computer Use00:24:51 Vision-Native Agents and the UI Integration Challenge00:25:02 2026 Predictions: Trust, Computer Use, and Democratized Excellence This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
As with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!) From SAM 1’s 11-million-image data engine to SAM 2’s memory-based video tracking, MSL’s Segment Anything project has redefined what’s possible in computer vision. Now SAM 3 takes the next leap: concept segmentation—prompting with natural language like “yellow school bus” or “tablecloth” to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio: SAM can now even segment audio output! We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception. We discuss: * What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like “purple umbrella” or “watering can” * How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly * Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and “fast mode” tracking * The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity * The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2 * Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale * Architecture innovations: presence token to separate recognition (”is it in the image?”) from localization (”where is it?”), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking * Building on Meta’s ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2’s memory-based tracking backbone * SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like “find the bigger character” or “what distinguishes male from female in this image” * Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples * Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more — MSL FAIR team * Nikhila: https://www.linkedin.com/in/nikhilaravi/ * Pengchuan: https://pzzhang.github.io/pzzhang/ Joseph Nelson * X: https://x.com/josephofiowa * LinkedIn: https://www.linkedin.com/in/josephofiowa/ Full Video Episode Timestamps 00:00:00 Introduction and the SAM Series Legacy00:00:53 SAM 3 Launch: Three Models in One Release00:05:30 Live Demo: Concept Prompting and Visual Exemplars00:10:54 From Prototype to Production: The Evolution of Text Prompting00:15:45 The Data Engine: Automating Exhaustive Annotation00:14:10 Real-World Impact: 130 Years of Humanity Saved00:25:11 Architecture Deep Dive: Decoupled Detection and Tracking00:28:02 SAM 3 Agent: Bridging Vision and Language Models00:33:20 Head-to-Head: SAM 3 vs Gemini and Florence00:47:50 Video Understanding and the Masklet Detection Score00:20:24 Fine-Tuning and Domain Adaptation: From Waymos to Medical Imaging00:52:25 The Future of Perception: Native Vision vs Tool Calls01:05:45 Building with SAM 3: Roboflow's Rapid Auto-Labeling00:57:02 Open Source Philosophy and the Path to AGI00:58:24 What's Next: SAM 4, Video Scale, and Beyond Human Performance This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Note: this is Pliny and John’s first major podcast. Voices have been changed for opsec. From jailbreaking every frontier model and turning down Anthropic’s Constitutional AI challenge to leading BT6, a 28-operator white-hat hacker collective obsessed with radical transparency and open-source AI security, Pliny the Liberator and John V are redefining what AI red-teaming looks like when you refuse to lobotomize models in the name of “safety.” Pliny built his reputation crafting universal jailbreaks—skeleton keys that obliterate guardrails across modalities—and open-sourcing prompt templates like Libertas, predictive reasoning cascades, and the infamous “Pliny divider” that’s now embedded so deep in model weights it shows up unbidden in WhatsApp messages. John V, coming from prompt engineering and computer vision, co-founded the Bossy Discord (40,000 members strong) and helps steer BT6’s ethos: if you can’t open-source the data, we’re not interested. Together they’ve turned down enterprise gigs, pushed back on Anthropic’s closed bounties, and insisted that real AI security happens at the system layer—not by bubble-wrapping latent space. We sat down with Pliny and John to dig into the mechanics of hard vs. soft jailbreaks, why multi-turn crescendo attacks were obvious to hackers years before academia “discovered” them, how segmented sub-agents let one jailbroken orchestrator weaponize Claude for real-world attacks (exactly as Pliny predicted 11 months before Anthropic’s recent disclosure), why guardrails are security theater that punishes capability while doing nothing for real safety, the role of intuition and “bonding” with models to navigate latent space, how BT6 vets operators on skill and integrity, why they believe Mech Interp and open-source data are the path forward (not RLHF lobotomization), and their vision for a future where spatial intelligence, swarm robotics, and AGI alignment research happen in the open—bootstrapped, grassroots, and uncompromising. We discuss: * What universal jailbreaks are: skeleton-key prompts that obliterate guardrails across models and modalities, and why they’re central to Pliny’s mission of “liberation” * Hard vs. soft jailbreaks: single-input templates vs. multi-turn crescendo attacks, and why the latter were obvious to hackers long before academic papers * The Libertas repo: predictive reasoning, the Library of Babel analogy, quotient dividers, weight-space seeds, and how introducing “steered chaos” pulls models out-of-distribution * Why jailbreaking is 99% intuition and bonding with the model: probing token layers, syntax hacks, multilingual pivots, and forming a relationship to navigate latent space * The Anthropic Constitutional AI challenge drama: UI bugs, judge failures, goalpost moving, the demand for open-source data, and why Pliny sat out the $30k bounty * Why guardrails ≠ safety: security theater, the futility of locking down latent space when open-source is right behind, and why real safety work happens in meatspace (not RLHF) * The weaponization of Claude: how segmented sub-agents let one jailbroken orchestrator execute malicious tasks (pyramid-builder analogy), and why Pliny predicted this exact TTP 11 months before Anthropic’s disclosure * BT6 hacker collective: 28 operators across two cohorts, vetted on skill and integrity, radical transparency, radical open-source, and the magic of moving the needle on AI security, swarm intelligence, blockchain, and robotics — Pliny the Liberator * X: https://x.com/elder_plinius * GitHub (Libertas): https://github.com/elder-plinius/L1B3RT45 John V * X: https://x.com/JohnVersus BT6 & Bossy * BT6: https://bt6.gg * Bossy Discord: Search “Bossy Discord” or ask Pliny/John V on X Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction: Meet Pliny the Liberator and John V00:01:50 The Philosophy of AI Liberation and Jailbreaking00:03:08 Universal Jailbreaks: Skeleton Keys to AI Models00:04:24 The Cat-and-Mouse Game: Attackers vs Defenders00:05:42 Security Theater vs Real Safety: The Fundamental Disconnect00:08:51 Inside the Libertas Repo: Prompt Engineering as Art00:16:22 The Anthropic Challenge Drama: UI Bugs and Open Source Data00:23:30 From Jailbreaks to Weaponization: AI-Orchestrated Attacks00:26:55 The BT6 Hacker Collective and BASI Community00:34:46 AI Red Teaming: Full Stack Security Beyond the Model00:38:06 Safety vs Security: Meat Space Solutions and Final Thoughts This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Glean started as a Kleiner Perkins incubation and is now a $7B, $200m ARR Enterprise AI leader. Now KP has tapped its own podcaster to lead it’s next big swing. From building go-to-market the hard way in startups (and scaling Palo Alto Networks’ public cloud business) to joining Kleiner Perkins to help technical founders turn product edge into repeatable revenue, Joubin Mirzadegan has spent the last decade obsessing over one thing: distribution and how ideas actually spread, sell, and compound. That obsession took him from launching the CRO-only podcast Grit (https://www.youtube.com/playlist?list=PLRiWZFltuYPF8A6UGm74K2q29UwU-Kk9k) as a hiring wedge, to working alongside breakout companies like Glean and Windsurf, to now incubating Roadrunner which is an AI-native rethink of CPQ and quoting workflows as pricing models collapse from “seats” into consumption, bundles, renewals, and SKU sprawl. We sat down with Joubin to dig into the real mechanics of making conversations feel human (rolling early, never sending questions, temperature + lighting hacks), what Windsurf got right about “Google-class product and Salesforce-class distribution,” how to hire early sales leaders without getting fooled by shiny logos, why CPQ is quietly breaking the back of modern revenue teams, and his thesis for his new company and KP incubation Roadrunner (https://www.roadrunner.ai/): rebuild the data model from the ground up, co-develop with the hairiest design partners, and eventually use LLMs to recommend deal structures the way the best reps do without the Slack-channel chaos of deal desk. We discuss: * How to make guests instantly comfortable: rolling early, no “are you ready?”, temperature, lighting, and room dynamics * Why Joubin refuses to send questions in advance (and when you might have to anyway) * The origin of the CRO-only podcast: using media as a hiring wedge and relationship engine * The “commit to 100 episodes” mindset: why most shows die before they find their voice * Founder vs exec interviews: why CEOs can speak more freely (and what it unlocks in conversation) * What Glean taught him about enterprise AI: permissions, trust, and overcoming “category is dead” skepticism * Design partners as the real unlock: why early believers matter and how co-development actually works * Windsurf’s breakout: what it means to be serious about “Google-class product + Salesforce-class distribution” * Why technical founders struggle with GTM and how KP built a team around sales, customer access, and demand gen * Hiring early sales leaders: anti-patterns (logos), what to screen for (motivation), and why stage-fit is everything * The CPQ problem & Roadrunner’s thesis: rebuilding CPQ/quoting from the data model up for modern complexity * How “rules + SKUs + approvals” create a brittle graph and what it takes to model it without tipping over * The two-year window: incumbents rebuilding slowly vs startups out-sprinting with AI-native architecture * Where AI actually helps: quote generation, policy enforcement, approval routing, and deal recommendation loops — Joubin * X: https://x.com/Joubinmir * LinkedIn: https://www.linkedin.com/in/joubin-mirzadegan-66186854/ Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction and the Zuck Interview Experience00:03:26 The Genesis of the Grit Podcast: Hiring CROs Through Content00:13:20 Podcast Philosophy: Creating Authentic Conversations00:15:44 Working with Arvind at Glean: The Enterprise Search Breakthrough00:26:20 Windsurf's Sales Machine: Google-Class Product Meets Salesforce-Class Distribution00:30:28 Hiring Sales Leaders: Anti-Patterns and First Principles00:39:02 The CPQ Problem: Why Salesforce and Legacy Tools Are Breaking00:43:40 Introducing Roadrunner: Solving Enterprise Pricing with AI00:49:19 Building Roadrunner: Team, Design Partners, and Data Model Challenges00:59:35 High Performance Philosophy: Working Out Every Day and Reducing Friction01:06:28 Defining Grit: Passion Plus Perseverance This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From applied cryptography and offensive security in France’s defense industry to optimizing nuclear submarine workflows, then selling his e-signature startup to Docusign (https://www.docusign.com/company/news-center/opentrust-joins-docusign-global-trust-network and now running AI as CTO of Superhuman Mail (Superhuman, recently acquired by Grammarly https://techcrunch.com/2025/07/01/grammarly-acquires-ai-email-client-superhuman/), Loïc Houssier has lived the full arc from deep infra and compliance hell to obsessing over 100ms product experiences and AI-native email. We sat down with Loïc to dig into how you actually put AI into an inbox without adding latency, why Superhuman leans so hard into agentic search and “Ask AI” over your entire email history, how they design tools vs. agents and fight agent laziness, what box-priced inference and local-first caching mean for cost and reliability, and his bet that your inbox will power your future AI EA while AI massively widens the gap between engineers with real fundamentals and those faking it. We discuss: * Loïc’s path from applied cryptography and offensive security in France’s defense industry to submarines, e-signatures, Docusign, and now Superhuman Mail * What 3,000+ engineers actually do at a “simple” product like Docusign: regional compliance, on-prem appliances, and why global scale explodes complexity * How Superhuman thinks about AI in email: auto-labels, smart summaries, follow-up nudges, “Ask AI” search, and the rule that AI must never add latency or friction * Superhuman’s agentic framework: tools vs. agents, fighting “agent laziness,” deep semantic search over huge inboxes, and pagination strategies to find the real needle in the haystack * How they evaluate OpenAI, Anthropic, Gemini, and open models: canonical queries, end-to-end evals, date reasoning, and Rahul’s infamous “what wood was my table?” test * Infra and cost philosophy: local-first caching, vector search backends, Baseten “box” pricing vs. per-token pricing, and thinking in price-per-trillion-tokens instead of price-per-million * The vision of Superhuman as your AI EA: auto-drafting replies in your voice, scheduling on your behalf, and using your inbox as the ultimate private data source * How the Grammarly + Coda + Superhuman stack could power truly context-aware assistance across email, docs, calendars, contracts, and more * Inside Superhuman’s AI-dev culture: free-for-all tool adoption, tracking AI usage on PRs, and going from ~4 to ~6 PRs per engineer per week * Why Loïc believes everyone should still learn to code, and how AI will amplify great engineers with strong fundamentals while exposing shallow ones even faster — Loïc Houssier * LinkedIn: https://www.linkedin.com/in/houssier/ Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction and Loïc's Journey from Nuclear Submarines to Superhuman00:06:40 Docusign Acquisition and the Enterprise Email Stack00:10:26 Superhuman's AI Vision: Your Inbox as the Real AI Agent00:13:20 Ask AI: Agentic Search and the Quality Problem00:18:20 Infrastructure Choices: Model Selection, Base10, and Cost Management00:27:30 Local-First Architecture and the Database Stack00:30:50 Evals, Quality, and the Rahul Wood Table Test00:42:30 The Future EA: Auto-Drafting and Proactive Assistance00:46:40 Grammarly Acquisition and the Contextual Advantage00:38:40 Voice, Video, and the End of Writing00:51:40 Knowledge Graphs: The Hard Problem Nobody Has Solved00:56:40 Competing with OpenAI and the Browser Question01:02:30 AI Coding Tools: From 4 to 6 PRs Per Week01:08:00 Engineering Culture, Hiring, and the Future of Software Development This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-500-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs. We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal’s privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world. We discuss: * How Medal’s 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models * Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans * Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe * Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation * Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players * Pim’s path from RuneScape private servers, Tourette’s, and reverse engineering to leading a frontier world-model lab * How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent * GI’s first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API * Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events * The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world — Pim * X: https://x.com/PimDeWitte * LinkedIn: https://www.linkedin.com/in/pimdw/ Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction and Medal's Gaming Data Advantage00:02:08 Exclusive Demo: Vision-Based Gaming Agents00:06:17 Action Prediction and Real-World Video Transfer00:08:41 World Models: Interactive Video Generation00:13:42 From Runescape to AI: Pim's Founder Journey00:16:45 The Research Foundations: Diamond, Genie, and SEMA00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI00:35:04 Data Moats and Why GI Stayed Independent00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course00:40:28 Defining World Models vs Video Generation00:41:52 Why Simulation Complexity Favors World Models00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships00:58:57 From Imitation Learning to RL: Making Clips Playable01:00:15 Open Research, Academic Partnerships, and Hiring01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D. We discuss: * The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone. * What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets. * Fei-fei’s essay: on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in. * Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning. * The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem. * Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters. * Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots. * How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world. * Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making. — Fei-Fei Li * X: https://x.com/drfeifei * LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson * X: https://x.com/jcjohnss * LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership00:02:00 From ImageNet to World Models: The Evolution of Computer Vision00:12:42 Dense Captioning and Early Vision-Language Work00:19:57 Spatial Intelligence: Beyond Language Models00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model00:33:21 Gaussian Splats and the Technical Architecture of Marble00:22:10 Physics, Dynamics, and the Future of World Models00:41:09 Multimodality and the Interplay of Language and Space00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI00:56:58 Hiring, Research Directions, and the Future of World Labs This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Alex Lieberman and Arman Hezarkani, co-founders of Tenex, reveal how they’re revolutionizing software consulting by compensating AI engineers for output rather than hours—enabling some engineers to earn over $1 million annually while delivering 10x productivity gains. Their company represents a fundamental rethinking of knowledge work compensation in the age of AI agents, where traditional hourly billing models perversely incentivize slower work even as AI tools enable unprecedented speed. The Genesis: From 90% Downsizing to 10x Output The story behind 10X begins with Arman’s previous company, Parthian, where he was forced to downsize his engineering team by 90%. Rather than collapse, Arman re-architected the entire product and engineering process to be AI-first—and discovered that production-ready software output increased 10x despite the massive headcount reduction. This counterintuitive result exposed a fundamental misalignment: engineers compensated by the hour are disincentivized from leveraging AI to work faster, even when the technology enables dramatic productivity gains. Alex, who had invested in Parthian, initially didn’t believe the numbers until Arman walked him through why LLMs have made such a profound impact specifically on engineering as knowledge work. The Economic Model: Story Points Over Hours 10X’s core innovation is compensating engineers based on story points—units of completed, quality output—rather than hours worked. This creates direct economic incentives for engineers to adopt every new AI tool, optimize their workflows, and maximize throughput. The company expects multiple engineers to earn over $1 million in cash compensation next year purely from story point earnings. To prevent gaming the system, they hire for two profiles: engineers who are “long-term selfish” (understanding that inflating story points will destroy client relationships) and those who genuinely love writing code and working with smart people. They also employ technical strategists incentivized on client retention (NRR) who serve as the final quality gate before any engineering plan reaches a client. Impressive Builds: From Retail AI to App Store Hits The results speak for themselves. In one project, 10X built a computer vision system for retail cameras that provides heat maps, queue detection, shelf stocking analysis, and theft detection—creating early prototypes in just two weeks for work that previously took quarters. They built Snapback Sports’ mobile trivia app in one month, which hit 20th globally on the App Store. In a sales context, an engineer spent four hours building a working prototype of a fitness influencer’s AI health coach app after the prospect initially said no—immediately moving 10X to the top of their vendor list. These examples demonstrate how AI-enabled speed fundamentally changes sales motions and product development timelines. The Interview Process: Unreasonably Difficult Take-Homes Despite concerns that AI would make take-home assessments obsolete, 10X still uses them—but makes them “unreasonably difficult.” About 50% of candidates don’t even respond, but those who complete the challenge demonstrate the caliber needed. The interview process is remarkably short: two calls before the take-home, review, then one or two final meetings—completable in as little as a week. A signature question: “If you had infinite resources to build an AI that could replace either of us on this call, what would be the first major bottleneck?” The sophisticated answer isn’t just “model intelligence” or “context length”—it’s controlling entropy, the accumulating error rate that derails autonomous agents over time. The Limiting Factor: Human Capital, Not Technology Despite being an AI-first company, 10X’s primary constraint is human capital—finding and hiring enough exceptional engineers fast enough, then matching them with the right processes to maintain delivery quality as they scale. The company has ambitions beyond consulting to build their own technology, but for the foreseeable future, recruiting remains the bottleneck. This reveals an important insight about the AI era: even as technology enables unprecedented leverage, the constraint shifts to finding people who can harness that leverage effectively. Full Video Episode Timestamps 00:00:00 Introduction and Meeting the 10X Co-founders00:01:29 The 10X Moment: From Hourly Billing to Output-Based Compensation00:04:44 The Economic Model Behind 10X00:05:42 Story Points and Measuring Engineering Output00:08:41 Impressive Client Projects and Rapid Prototyping00:12:22 The 10X Tech Stack: TypeScript and High Structure00:13:21 AI Coding Tools: The Daily Evolution00:15:05 Human Capital as the Limiting Factor00:16:02 The Unreasonably Difficult Interview Process00:17:14 Entropy and Context Engineering: The Future of AI Agents00:23:28 The MCP Debate and AI Industry Sociology00:26:01 Consulting, Digital Transformation, and Conference Insights This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Deedy Das, Partner at Menlo Ventures, returns to Latent Space to discuss his journey from Glean to venture capital, the explosive rise of Anthropic, and how AI is reshaping enterprise software and coding. From investing in Anthropic early on when they had no revenue to managing the $100M Ontology Fund, Das shares insider perspectives on the fastest-growing software company in history and what’s next for AI infrastructure, research investing, and the future of engineering. We cover Glean’s rise from “boring” enterprise search to a $7B AI-native company, Anthropic’s meteoric rise, the strategic decisions behind products like Claude Code, and why market share in enterprise AI is shifting dramatically. Das explains his investment thesis on research companies like Goodfire, Prime Intellect, and OpenRouter and how the Anthology Fund is quietly seeding the next wave of AI infra, research, and devtools. Full Video Episode Timestamps * 00:00:00 Introduction and Deedy’s Return to Latent Space * 00:01:20 Glean’s Journey: From Boring Enterprise Search to Valuation * 00:15:37 Anthropic’s Meteoric Rise and Market Share Dynamics * 00:17:50 Claude Artifacts and Product Innovation * 00:41:20 The Anthology Fund: Investing in the Anthropic Ecosystem * 00:48:01 Goodfire and Mechanistic Interpretability * 00:51:25 Prime Intellect and Distributed AI Training * 00:53:40 OpenRouter: Building the AI Model Gateway * 01:13:36 The Stargate Project and Infrastructure Arms Race * 01:18:14 The Future of Software Engineering and AI Coding This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Jared Palmer, SVP at GitHub and VP of CoreAI at Microsoft, joins Latent Space for an in-depth look at the evolution of coding agents and modern developer tools. Recently joining after leading AI initiatives at Vercel, Palmer shares firsthand insights from behind the scenes at GitHub Universe, including the launch of Agent HQ which is a new collaboration hub for coding agents and developers. This episode traces Palmer’s journey from building Copilot inspired tools to pioneering the focused Next.js coding agent, v0, and explores how platform constraints fostered rapid experimentation and a breakout success in AI-powered frontend development. Palmer explains the unique advantages of GitHub’s massive developer network, the challenges of scaling agent-based workflows, and why integrating seamless AI into developer experiences is now a top priority for both Microsoft and GitHub. Full Video Episode Timestamps 00:00:00 Introduction and Jared's New Role at GitHub00:01:00 From V0 to Agent HQ: The Evolution of Coding Agents00:02:51 The V0 Origin Story: From ChatGPT to AI Playground00:05:40 Building the AI SDK and ShadCN Collaboration00:07:08 The Birth of V0: Prompt to UI Revolution00:09:18 V0's Growth Journey and Model Evolution00:11:05 Model Strategy: Composite Models vs User Choice00:13:16 GitHub's Agent HQ and Model Marketplace00:15:51 The Future of Agent Abstraction and Standards00:16:33 Microsoft Core AI Integration and Workflow Vision00:18:37 Dev Containers and Repo Setup Challenges00:24:10 Agent Quality and Infrastructure Reliability00:27:05 Using Coding Agents for Non-Coding Tasks00:29:11 GitHub Homepage Redesign and Community Feedback00:30:27 Stacked Diffs: GitHub's Most Requested Feature This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Jed Borovik, Product Lead at Google Labs, joins Latent Space to unpack how Google is building the future of AI-powered software development with Jules. From his journey discovering GenAI through Stable Diffusion to leading one of the most ambitious coding agent projects in tech, Borovik shares behind-the-scenes insights into how Google Labs operates at the intersection of DeepMind’s model development and product innovation. We explore Jules’ approach to autonomous coding agents and why they run on their own infrastructure, how Google simplified their agent scaffolding as models improved, and why embeddings-based RAG is giving way to attention-based search. Borovik reveals how developers are using Jules for hours or even days at a time, the challenges of managing context windows that push 2 million tokens, and why coding agents represent both the most important AI application and the clearest path to AGI. This conversation reveals Google’s positioning in the coding agent race, the evolution from internal tools to public products, and what founders, developers, and AI engineers should understand about building for a future where AI becomes the new brush for software engineering. Full Video Episode Timestamps 00:00:00 Introduction and GitHub Universe Recap00:00:57 New York Tech Scene and East Coast Hackathons00:02:19 From Google Search to AI Coding: Jed's Journey00:04:19 Google Labs Mission and DeepMind Collaboration00:06:41 Jules: Autonomous Coding Agents Explained00:09:39 The Evolution of Agent Scaffolding and Model Quality00:11:30 RAG vs Attention: The Shift in Code Understanding00:13:49 Jules' Journey from Preview to Production00:15:05 AI Engineer Summit: Community Building and Networking00:25:06 Context Management in Long-Running Agents00:29:02 The Future of Software Engineering with AI00:36:26 Beyond Vibe Coding: Spec Development and Verification00:40:20 Multimodal Input and Computer Use for Coding Agents This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
In this conversation with Malte Ubl, CTO of Vercel (http://x.com/cramforce), we explore how the company is pioneering the infrastructure for AI-powered development through their comprehensive suite of tools including workflows, AI SDK, and the newly announced agent ecosystem. Malte shares insights into Vercel’s philosophy of “dogfooding” - never shipping abstractions they haven’t battle-tested themselves - which led to extracting their AI SDK from v0 and building production agents that handle everything from anomaly detection to lead qualification. The discussion dives deep into Vercel’s new Workflow Development Kit, which brings durable execution patterns to serverless functions, allowing developers to write code that can pause, resume, and wait indefinitely without cost. Malte explains how this enables complex agent orchestration with human-in-the-loop approvals through simple webhook patterns, making it dramatically easier to build reliable AI applications. We explore Vercel’s strategic approach to AI agents, including their DevOps agent that automatically investigates production anomalies by querying observability data and analyzing logs - solving the recall-precision problem that plagues traditional alerting systems. Malte candidly discusses where agents excel today (meeting notes, UI changes, lead qualification) versus where they fall short, emphasizing the importance of finding the “sweet spot” by asking employees what they hate most about their jobs. The conversation also covers Vercel’s significant investment in Python support, bringing zero-config deployment to Flask and FastAPI applications, and their vision for security in an AI-coded world where developers “cannot be trusted.” Malte shares his perspective on how CTOs must transform their companies for the AI era while staying true to their core competencies, and why maintaining strong IC (individual contributor) career paths is crucial as AI changes the nature of software development. What was launched at Ship AI 2025: AI SDK 6.0 & Agent Architecture * Agent Abstraction Philosophy: AI SDK 6 introduces an agent abstraction where you can “define once, deploy everywhere”. How does this differ from existing agent frameworks like LangChain or AutoGPT? What specific pain points did you observe in production that led to this design? * Human-in-the-Loop at Scale: The tool approval system with needsApproval: true gates actions until human confirmation. How do you envision this working at scale for companies with thousands of agent executions? What’s the queue management and escalation strategy? * Type Safety Across Models: AI SDK 6 promises “end-to-end type safety across models and UI”. Given that different LLMs have varying capabilities and output formats, how do you maintain type guarantees when swapping between providers like OpenAI, Anthropic, or Mistral? Workflow Development Kit (WDK) * Durability as Code: The use workflow primitive makes any TypeScript function durable with automatic retries, progress persistence, and observability. What’s happening under the hood? Are you using event sourcing, checkpoint/restart, or a different pattern? * Infrastructure Provisioning: Vercel automatically detects when a function is durable and dynamically provisions infrastructure in real-time. What signals are you detecting in the code, and how do you determine the optimal infrastructure configuration (queue sizes, retry policies, timeout values)? Vercel Agent (beta) * Code Review Validation: The Agent reviews code and proposes “validated patches”. What does “validated” mean in this context? Are you running automated tests, static analysis, or something more sophisticated? * AI Investigations: Vercel Agent automatically opens AI investigations when it detects performance or error spikes using real production data. What data sources does it have access to? How does it distinguish between normal variance and actual anomalies? Python Support (For the first time, Vercel now supports Python backends natively.) Marketplace & Agent Ecosystem * Agent Network Effects: The Marketplace now offers agents like CodeRabbit, Corridor, Sourcery, and integrations with Autonoma, Braintrust, Browser Use. How do you ensure these third-party agents can’t access sensitive customer data? What’s the security model? “An Agent on Every Desk” Program * Vercel launched a new program to help companies identify high-value use cases and build their first production AI agents. It provides consultations, reference templates, and hands-on support to go from idea to deployed agent Full Video Episode Timestamps 00:00 Introduction and Malte’s Background at Google 01:16 Vercel’s AI Engineering Philosophy and Ship AI Recap 03:19 Deep Dive: Workflows vs Agents Architecture 09:33 AI SDK Success Story: Staying Low-Level and Humble 16:35 Framework Design Principles and Open Source Strategy 19:20 Vercel Agent: AI-Powered DevOps and Anomaly Detection 27:06 Internal Agent Use Cases: Lead Qualification and Abuse Analysis 29:49 Agent on Every Desk Program and Enterprise Adoption 32:13 Python Support and Multi-Language Infrastructure 39:42 The Future of AI-Native Security and Development This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
In this deep dive with Kyle Corbitt, co-founder and CEO of OpenPipe (recently acquired by CoreWeave), we explore the evolution of fine-tuning in the age of AI agents and the critical shift from supervised fine-tuning to reinforcement learning. Kyle shares his journey from leading YC’s Startup School to building OpenPipe, initially focused on distilling expensive GPT-4 workflows into smaller, cheaper models before pivoting to RL-based agent training as frontier model prices plummeted. The conversation reveals why 90% of AI projects remain stuck in proof-of-concept purgatory - not due to capability limitations, but reliability issues that Kyle believes can be solved through continuous learning from real-world experience. He discusses the breakthrough of RULER (Relative Universal Reinforcement Learning Elicited Rewards), which uses LLMs as judges to rank agent behaviors relatively rather than absolutely, making RL training accessible without complex reward engineering. Kyle candidly assesses the challenges of building realistic training environments for agents, explaining why GRPO (despite its advantages) may be a dead end due to its requirement for perfectly reproducible parallel rollouts. He shares insights on why LoRAs remain underrated for production deployments, why GEPA and prompt optimization haven’t lived up to the hype in his testing, and why the hardest part of deploying agents isn’t the AI - it’s sandboxing real-world systems with all their bugs and edge cases intact. The discussion also covers OpenPipe’s acquisition by CoreWeave, the launch of their serverless reinforcement learning platform, and Kyle’s vision for a future where every deployed agent continuously learns from production experience. He predicts that solving the reliability problem through continuous RL could unlock 10x more AI inference demand from projects currently stuck in development, fundamentally changing how we think about agent deployment and maintenance. Key Topics: * The rise and fall of fine-tuning as a business model * Why 90% of AI projects never reach production * RULER: Making RL accessible through relative ranking * The environment problem: Why sandboxing is harder than training * GRPO vs PPO and the future of RL algorithms * LoRAs: The underrated deployment optimization * Why GEPA and prompt optimization disappointed in practice * Building world models as synthetic training environments * The $500B Stargate bet and OpenAI’s potential crypto play * Continuous learning as the path to reliable agents References https://www.linkedin.com/in/kcorbitt/ * Aug 2023 https://openpipe.ai/blog/from-prompts-to-models * DEC 2023 https://openpipe.ai/blog/mistral-7b-fine-tune-optimized * JAN 2024 https://openpipe.ai/blog/s-lora * MAY 2024 https://openpipe.ai/blog/the-ten-commandments-of-fine-tuning-in-prod * Oct 2024 https://openpipe.ai/blog/announcing-dpo-support * AIE NYC 2025 Finetuning 500m agents * AIEWF 2025 How to train your agent (ART-E) * SEPT 2025 ACQUISTION https://openpipe.ai/blog/openpipe-coreweave * W&B Serverless RL https://openpipe.ai/blog/serverless-rl?refresh=1760042248153 Full Video Episode Timestamps 00:00 Introductions 03:15 The Evolution of OpenPipe: From SFT to RL 07:49 The Mistral Era and LoRA Adapters 11:40 When You Actually Need Fine-Tuning 14:43 The Pivot to Reinforcement Learning 21:29 GRPO vs PPO: The Technical Trade-offs 24:02 The Environment Problem in RL 35:52 JAPA and Automated Prompt Optimization 44:35 Open vs Closed Models: The Token Economics 50:38 Ruler: Self-Supervised RL Rewards 57:09 World Models as Environment Solutions 1:00:15 CoreWeave Acquisition and Future Vision This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
At OpenAI DevDay, we sit down with Sherwin Wu and Christina Huang from the OpenAI Platform Team to discuss the launch of AgentKit - a comprehensive suite of tools for building, deploying, and optimizing AI agents. Christina walks us through the live demo she performed on stage, building a customer support agent in just 8 minutes using the visual Agent Builder, while Sherwin shares insights on how OpenAI is inverting the traditional website-chatbot paradigm by embedding apps directly within ChatGPT through the new Apps SDK. The conversation explores how OpenAI is tackling the challenges developers face when taking agents to production - from writing and optimizing prompts to building evaluation pipelines. They discuss the decision to adopt Anthropic’s MCP protocol for tool connectivity, the importance of visual workflows for complex agent systems, and how features like human-in-the-loop approvals and automated prompt optimization are making agent development more accessible to a broader range of developers. Sherwin and Christina also reveal how OpenAI is dogfooding these tools internally, with their own customer support at openai.com already powered by AgentKit, and share candid insights about the evolution from plugins to GPTs to this new agent platform. They discuss the surprising persistence of prompting as a critical skill (contrary to predictions from two years ago), the challenges of serving custom fine-tuned models at scale, and why they believe visual agent builders are essential as workflows grow to span dozens of nodes. Guests: * Sherwin Wu: Head of Engineering, OpenAI Platform https://www.linkedin.com/in/sherwinwu1/ https://x.com/sherwinwu?lang=en * Christina Huang: Platform Experience, OpenAI https://x.com/christinaahuang https://www.linkedin.com/in/christinaahuang/ Thanks very much to Lindsay and Shaokyi for helping us set up this great deepdive into the new DevDay launches! Key Topics:• AgentKit launch: Agent SDK, Builder, Evals, and deployment tools• Apps SDK and the inversion of the app-chatbot paradigm• Adopting MCP protocol for universal tool connectivity• Visual agent building vs code-first approaches• Human-in-the-loop workflows and approval systems• Automated prompt optimization and “zero-gradient fine-tuning”• Service Health Dashboard and achieving five nines reliability• ChatKit as an embeddable, evergreen chat interface• The evolution from plugins to GPTs to agent platforms• Internal dogfooding with Codex and agent-powered support Full Video Episode Timestamps 00:00 Welcome to the OpenAI Dev Day Studio 01:11 Dev Day Evolution and Community Growth 03:08 Apps SDK and ChatGPT Distribution Strategy 05:27 MCP Protocol Integration Decision 09:26 Agent Kit Launch and Platform Vision 11:33 Agent Builder Canvas and Visual Workflows 17:22 Evaluations and Agent Testing Evolution 19:20 Automated Prompt Optimization and Research 26:35 Connector Registry and MCP Servers 34:10 Chat Kit as Consumer-Grade Infrastructure 39:13 Codex Power User Tips and AI-Native Development 42:27 Service Health Dashboard and Reliability Journey This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Dylan Field (CEO Figma) on how they are letting designers build with Figma Make, how Figma can be the context repository for aesthetic in the age of vibe coding, and why design is your only differentiator now. Full show notes: https://www.latent.space/p/figma This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Quinn Slack (CEO) and Thorsten Ball (Amp Dictator) from SourceGraph join the show to talk about Amp Code, how they ship 15x/day with no code reviews, and why subagents and prompt optimizers aren’t a promising direction for coding agents. Amp Code: https://ampcode.com/ Latent Space: https://latent.space/ Full Video Episode Timestamps 00:00 Introduction00:41 Transition from Cody to Amp03:18 The Importance of Building the Best Coding Agent06:43 Adapting to a Rapidly Evolving AI Tooling Landscape09:36 Dogfooding at Sourcegraph12:35 CLI vs. VS Code Extension21:08 Positioning Amp in Coding Agent Market24:10 The Diminishing Importance of Model Selectors32:39 Tooling vs. Harness37:19 Common Failure Modes of Coding Agents47:33 Agent-Friendly Logging and Tooling52:31 Are Subagents Real?56:52 New Frameworks and Agent-Integrated Developer Tools1:00:25 How Agents Are Encouraging Codebase and Workflow Changes1:03:13 Evolving Outer Loop Tasks1:07:09 Version Control and Merge Conflicts in an AI-First World1:10:36 Rise of User-Generated Enterprise Software1:14:39 Empowering Technical Leaders with AI1:17:11 Evaluating Product Without Traditional Evals1:20:58 Hiring This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Lance: https://www.linkedin.com/in/lance-martin-64a33b5/ How Context Fails: https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.htmlHow New Buzzwords Get Created: https://www.dbreunig.com/2025/07/24/why-the-term-context-engineering-matters.htmlContent Engineering: https://rlancemartin.github.io/2025/06/23/context_engineering/ https://docs.google.com/presentation/d/16aaXLu40GugY-kOpqDU4e-S0hD1FmHcNyF0rRRnb1OU/edit?usp=sharingManus Post: https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-ManusCognition Post: https://cognition.ai/blog/dont-build-multi-agentsMulti-Agent Researcher: https://www.anthropic.com/engineering/multi-agent-research-systemHuman-in-the-loop + Memory: https://github.com/langchain-ai/agents-from-scratch- Bitter Lesson in AI Engineering -Hyung Won Chung on the Bitter Lesson in AI Research: Bitter Lesson w/ Claude Code: Learning the Bitter Lesson in AI Engineering: https://rlancemartin.github.io/2025/07/30/bitter_lesson/Open Deep Research: https://github.com/langchain-ai/open_deep_research https://academy.langchain.com/courses/deep-research-with-langgraphScaling and building things that “don’t yet work”: - Frameworks -Roast framework at Shopify / standardization of orchestration tools: MCP adoption within Anthropic / standardization of protocols: How to think about frameworks: https://blog.langchain.com/how-to-think-about-agent-frameworks/RAG benchmarking: https://rlancemartin.github.io/2025/04/03/vibe-code/Simon’s talk with memory-gone-wrong: https://simonwillison.net/2025/Jun/6/six-months-in-llms/ Full Video Episode Timestamps 00:00 Introduction and Background 00:53 The Rise of Context Engineering 01:57 Context Engineering vs Prompt Engineering 05:56 The Five Categories of Context Engineering 10:02 Multi-Agent Systems and Context Isolation 14:48 Classical Retrieval vs Agentic Search 17:12 LLMs.txt and MCP Servers 24:51 Context Pruning and Memory Management 37:25 Memory Systems and Human-in-the-Loop 42:55 The Bitter Lesson Applied to AI Engineering 51:21 Frameworks, Abstractions, and Building for the Future This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Our chat with Ari shows that data curation is the most impactful and underinvested area in AI. He argues that the prevailing focus on model architecture and compute scaling overlooks the “bitter lesson” that “models are what they eat.” Effective data curation—a sophisticated process involving filtering, rebalancing, sequencing (curriculum), and synthetic data generation—allows for training models that are simultaneously faster, better, and smaller. Morcos recounts his personal journey from focusing on model-centric inductive biases to realizing that data quality is the primary lever for breaking the diminishing returns of naive scaling laws. Datology’s mission is to automate this complex curation process, making state-of-the-art data accessible to any organization and enabling a new paradigm of AI development where data efficiency, not just raw scale, drives progress. Full Video Episode Timestamps 00:00 Introduction 00:46 What is Datology? The mission to train models faster, better, and smaller through data curation. 01:59 Ari’s background: From neuroscience to realizing the “Bitter Lesson” of AI. 05:30 Key Insight: Inductive biases from architecture become less important and even harmful as data scale increases. 08:08 Thesis: Data is the most underinvested area of AI research relative to its impact. 10:15 Why data work is culturally undervalued in research and industry. 12:19 How self-supervised learning changed everything, moving from a data-scarce to a data-abundant regime. 17:05 Why automated curation is superior to human-in-the-loop, citing the DCLM study. 19:22 The “Elephants vs. Dogs” analogy for managing data redundancy and complexity. 22:46 A brief history and commentary on key datasets (Common Crawl, GitHub, Books3). 26:24 Breaking naive scaling laws by improving data quality to maintain high marginal information gain. 29:07 Datology’s demonstrated impact: Achieving baseline performance 12x faster. 34:19 The business of data: Datology’s moat and its relationship with open-source datasets. 39:12 Synthetic Data Explain ed: The difference between risky “net-new” creation and powerful “rephrasing.” 49:02 The Resurgence of Curriculum Learning: Why ordering data matters in the underfitting regime. 52:55 The Future of Training: Optimizing pre-training data to make post-training more effective. 54:49 Who is training their own models and why (Sovereign AI, large enterprises). 57:24 “Train Smaller”: Why inference cost makes smaller, specialized models the ultimate goal for enterprises. 01:00:19 The problem with model pruning and why data-side solutions are complementary. 01:03:03 On finding the smallest possible model for a given capability. 01:06:49 Key learnings from the RC foundation model collaboration, proving that data curation “stacks.” 01:09:46 Lightning Round: What data everyone wants & who should work at Datology. 01:14:24 Commentary on Meta’s superintelligence efforts and Yann LeCun’s role. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
We first had Nathan on to give us his RLHF deep dive when he was joining AI2, and now he’s back to help us catch up on the evolution to RLVR (Reinforcement Learning with Verifiable Rewards), first proposed in his Tulu 3 paper. While RLHF remains foundational, RLVR has emerged as a powerful approach for training models on tasks with clear success criteria and using verifiable, objective functions as reward signals—particularly useful in domains like math, code correctness, and instruction-following. Instead of relying solely on subjective human feedback, RLVR leverages deterministic signals to guide optimization, making it more scalable and potentially more reliable across many domains. However, he notes that RLVR is still rapidly evolving, especially regarding how it handles tool use and multi-step reasoning. We also discussed the Tulu model series, a family of instruction-tuned open models developed at AI2. Tulu is designed to be a reproducible, state-of-the-art post-training recipe for the open community. Unlike frontier labs like OpenAI or Anthropic, which rely on vast and often proprietary datasets, Tulu aims to distill and democratize best practices for instruction and preference tuning. We are impressed with how small eval suites, careful task selection, and transparent methodology can rival even the best proprietary models on specific benchmarks. One of the most fascinating threads is the challenge of incorporating tool use into RL frameworks. Lambert highlights that while you can prompt a model to use tools like search or code execution, getting the model to reliably learn when and how to use them through RL is much harder. This is compounded by the difficulty of designing reward functions that avoid overoptimization—where models learn to “game” the reward signal rather than solve the underlying task. This is particularly problematic in code generation, where models might reward hack unit tests by inserting pass statements instead of correct logic. As models become more agentic and are expected to plan, retrieve, and act across multiple tools, reward design becomes a critical bottleneck. Other topics covered: - The evolution from RLHF (Reinforcement Learning from Human Feedback) to RLVR (Reinforcement Learning from Verifiable Rewards)- The goals and technical architecture of the Tulu models, including the motivation to open-source post-training recipes- Challenges of tool use in RL: verifiability, reward design, and scaling across domains- Evaluation frameworks and the role of platforms like Chatbot Arena and emerging “arena”-style benchmarks- The strategic tension between hybrid reasoning models and unified reasoning models at the frontier- Planning, abstraction, and calibration in reasoning agents and why these concepts matter- The future of open-source AI models, including DeepSeek, OLMo, and the potential for an “American DeepSeek”- The importance of model personality, character tuning, and the model spec paradigm- Overoptimization in RL settings and how it manifests in different domains (control tasks, code, math)- Industry trends in inference-time scaling and model parallelism Finally, the episode closes with a vision for the future of open-source AI. Nathan has now written up his ambition to build an “American DeepSeek”—a fully open, end-to-end reasoning-capable model with transparent training data, tools, and infrastructure. He emphasizes that open-source AI is not just about weights; it’s about releasing recipes, evaluations, and methods that lower the barrier for everyone to build and understand cutting-edge systems. Full Video Episode Timestamps 00:00 Welcome and Guest Introduction 01:18 Tulu, OVR, and the RLVR Journey 03:40 Industry Approaches to Post-Training and Preference Data 06:08 Understanding RLVR and Its Impact 06:18 Agents, Tool Use, and Training Environments 10:34 Open Data, Human Feedback, and Benchmarking 12:44 Chatbot Arena, Sycophancy, and Evaluation Platforms 15:42 RLHF vs RLVR: Books, Algorithms, and Future Directions 17:54 Frontier Models: Reasoning, Hybrid Models, and Data 22:11 Search, Retrieval, and Emerging Model Capabilities 29:23 Tool Use, Curriculum, and Model Training Challenges 38:06 Skills, Planning, and Abstraction in Agent Models 46:50 Parallelism, Verifiers, and Scaling Approaches 54:33 Overoptimization and Reward Design in RL 1:02:27 Open Models, Personalization, and the Model Spec 1:06:50 Open Model Ecosystem and Infrastructure 1:13:05 Meta, Hardware, and the Future of AI Competition 1:15:42 Building an Open DeepSeek and Closing Thoughts This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
ChatGPT handles 2.5B prompts/day and is on track to match Google’s daily searches by end of 2026. AI agents don’t browse like us—they crave queryable, chunkable data for tools like ChatGPT & Perplexity. A new industry is being born, some are calling it AI SEO, others GEO, but what is clear is that it drives amazing results. Businesses are seeing 2-4x higher conversion from visitors coming from AI compared to traditional search. Robert McCloy is the co-founder of Scrunch AI (https://scrunchai.com/), a fast growing company that helps brands and businesses re-write their content on the fly based on what agents are looking for. Full Video Episode Timestamps 00:00 Intro & Guest Introduction 01:30 The Genesis of Scrunch AI & AI Search Impact 06:02 AI Search Engines vs. Traditional SEO 06:28 Monitoring Prompts & The AI Search Stack 08:26 AI Training Data, Crawlers, and Content Strategy 12:33 AI Browsers and the Future of Web Consumption 16:06 Technical Mechanisms of AI Search & SEO Relevance 28:44 Personalization, Agent Experience, and Customer Journeys 30:44 Prompt Clusters, User Intent, and B2B Buying Patterns 36:06 Optimization Tactics: Prompt Injection, Content, and Pitfalls 40:37 Technical Content Delivery: JavaScript, Programmatic SEO, and LMS.txt 47:31 Case Studies & Conversion Optimization 51:36 Market Share & Platform Trends in AI Search 55:10 Wrap-Up & Future of AI-Driven Web This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Saoud Rizwan and Pash from Cline joined us to talk about why fast apply models got bitter lesson’d, how they pioneered the plan + act paradigm for coding, and why non-technical people use IDEs to do marketing and generate slides. Full writeup: https://www.latent.space/p/cline X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00 - Introductions 01:35 - Plan and Act Paradigm 05:37 - Model Evaluation and Early Development of Cline 08:14 - Use Cases of Cline Beyond Coding 09:09 - Why Cline is a VS Code Extension and Not a Fork 12:07 - Economic Value of Programming Agents 16:07 - Early Adoption for MCPs 19:35 - Local vs Remote MCP Servers 22:10 - Anthropic’s Role in MCP Registry 22:49 - Most Popular MCPs and Their Use Cases 25:26 - Challenges and Future of MCP Monetization 27:32 - Security and Trust Issues with MCPs 28:56 - Alternative History Without MCP 29:43 - Market Positioning of Coding Agents and IDE Integration Matrix 32:57 - Visibility and Autonomy in Coding Agents 35:21 - Evolving Definition of Complexity in Programming Tasks 38:16 - Forks of Cline and Open Source Regrets 40:07 - Simplicity vs Complexity in Agent Design 46:33 - How Fast Apply Got Bitter Lesson’d 49:12 - Cline’s Business Model and Bring-Your-Own-API-Key Approach 54:18 - Integration with OpenRouter and Enterprise Infrastructure 55:32 - Impact of Declining Model Costs 57:48 - Background Agents and Multi-Agent Systems 1:00:42 - Vision and Multi-Modalities 1:01:07 - State of Context Engineering 1:07:37 - Memory Systems in Coding Agents 1:10:14 - Standardizing Rules Files Across Agent Tools 1:11:16 - Cline’s Personality and Anthropomorphization 1:12:55 - Hiring at Cline and Team Culture This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Speak (https://speak.com) may not be very well known to native English speakers, but they have come from a slow start in 2016 to emerge as one of the favorite partners of OpenAI, with their Startup Fund leading and joining their Series B and C as one of the new AI-native unicorns, noting that “Speak has the potential to revolutionize not just language learning, but education broadly”. Today we speak with Speak’s CTO, Andrew Hsu, on the journey of building the “3rd generation” of language learning software (with Rosetta Stone being Gen 1, and Duolingo being Gen 2). Speak’s premise is that speech and language models can now do what was previously only possible with human tutors—provide fluent, responsive, and adaptive instruction—and this belief has shaped its product and company strategy since its early days. https://www.linkedin.com/in/adhsu/ https://speak.com One of the most interesting strategic decisions discussed in the episode is Speak’s early focus on South Korea. While counterintuitive for a San Francisco-based startup, the decision was influenced by a combination of market opportunity and founder proximity via a Korean first employee. South Korea’s intense demand for English fluency and a highly competitive education market made it a proving ground for a deeply AI-native product. By succeeding in a market saturated with human-based education solutions, Speak validated its model and built strong product-market fit before expanding to other Asian markets and eventually, globally. The arrival of Whisper and GPT-based LLMs in 2022 marked a turning point for Speak. Suddenly, capabilities that were once theoretical—real-time feedback, semantic understanding, conversational memory—became technically feasible. Speak didn’t pivot, but rather evolved into its second phase: from a supplemental practice tool to a full-featured language tutor. This transition required significant engineering work, including building custom ASR models, managing latency, and integrating real-time APIs for interactive lessons. It also unlocked the possibility of developing voice-first, immersive roleplay experiences and a roadmap to real-time conversational fluency. To scale globally and support many languages, Speak is investing heavily in AI-generated curriculum and content. Instead of manually scripting all lessons, they are building agents and pipelines that can scaffold curriculum, generate lesson content, and adapt pedagogically to the learner. This ties into one of Speak’s most ambitious goals: creating a knowledge graph that captures what a learner knows and can do in a target language, and then adapting the course path accordingly. This level-adjusting tutor model aims to personalize learning at scale and could eventually be applied beyond language learning to any educational domain. Finally, the conversation touches on the broader implications of AI-powered education and the slow real-world adoption of transformative AI technologies. Despite the capabilities of GPT-4 and others, most people’s daily lives haven’t changed dramatically. Speak sees itself as part of the generation of startups that will translate AI’s raw power into tangible consumer value. The company is also a testament to long-term conviction—founded in 2016, it weathered years of slow growth before AI caught up to its vision. Now, with over $50M ARR, a growing B2B arm, and plans to expand across languages and learning domains, Speak represents what AI-native education could look like in the next decade. Full Video Episode Timestamps 00:00 Introductions & Thiel Fellowship Origins 02:13 Genesis of Speak: Early Vision & Market Focus 03:44 Building the Product: Iterations and Lessons Learned 10:59 AI’s Role in Language Learning 13:49 Scaling Globally & B2B Expansion 16:30 Why Korea? Localizing for Success 19:08 Content Creation, The Speak Method, and Engineering Culture 23:31 The Impact of Whisper and LLM Advances 29:08 AI-Generated Content & Measuring Fluency 35:30 Personalization, Dialects, and Pronunciation 39:38 Immersive Learning, Multimodality, and Real-Time Voice 50:02 Engineering Challenges & Company Culture 53:20 Beyond Languages: B2B, Knowledge Graphs, and Broader Learning 57:32 Fun Stories, Lessons, and Reflections 1:02:03 Final Thoughts: The Future of AI Learning & Slow Takeoff This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
When the first video diffusion models started emerging, they were little more than just “moving pictures” - still frames extended a few seconds in either direction in time. There was a ton of excitement about OpenAI’s Sora on release through 2024, but so far only Sora-lite has been widely released. Meanwhile, other good videogen models like Genmo Mochi, Pika, MiniMax T2V, Tencent Hunyuan Video, and Kuaishou’s Kling have emerged, but the reigning king this year seems to be Google’s Veo 3, which for the first time has added native audio generation into their model capabilities, eliminating the need for a whole class of lipsynching tooling and SFX editing. The rise of Veo 3 unlocks a whole new category of AI Video creators that many of our audience may not have been exposed to, but is undeniably effective and important particularly in the “kids” and “brainrot” segments of the global consumer internet platforms like Tiktok, YouTube and Instagram. By far the best documentarians of these trends for laypeople are Olivia and Justine Moore, both partners at a16z, who not only collate the best examples from all over the web, but dabble in video creation themselves to put theory into practice. We’ve been thinking of dabbling in AI brainrot on a secondary channel for Latent Space, so we wanted to get the braindump from the Moore twins on how to make a Latent Space Brainrot channel. Jump on in! Full Video Episode Timestamps 00:00 Introductions & Guest Welcome 00:49 The Rise of Generative Media 02:24 AI Video Trends: Italian Brain Rot & Viral Characters 05:00 Following Trends & Creating AI Content 07:17 Hands-On with AI Video Creation 18:36 Monetization & Business of AI Content 23:34 Platforms, Models, and the Creator Stack 37:22 Native Content vs. Clipping & Going Viral 41:52 Prompt Theory & Meta-Trends in AI Creativity 47:42 Professional, Commercial, and Platform-Specific AI Video 48:57 Wrap-Up & Final Thoughts This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Our last AI PhD grad student feature was Shunyu Yao, who happened to focus on Language Agents for his thesis and immediately went to work on them for OpenAI. Our pick this year is Jack Morris, who bucks the “hot” trends by -not- working on agents, benchmarks, or VS Code forks, but is rather known for his work on the information theoretic understanding of LLMs, starting from embedding models and latent space representations (always close to our heart). Jack is an unusual combination of doing underrated research but somehow still being to explain them well to a mass audience, so we felt this was a good opportunity to do a different kind of episode going through the greatest hits of a high profile AI PhD, and relate them to questions from AI Engineering. Papers and References made * AI grad school: * A new type of information theory: * Embeddings * Text Embeddings Reveal (Almost) As Much As Text: https://arxiv.org/abs/2310.06816 * Contextual document embeddings https://arxiv.org/abs/2410.02525 Harnessing the Universal Geometry of Embeddings: https://arxiv.org/abs/2505.12540 * Language models * GPT-style language models memorize 3.6 bits per param: * Approximating Language Model Training Data from Weights: https://arxiv.org/abs/2506.15553 * LLM Inversion * “There Are No New Ideas In AI.... Only New Datasets” * misc reference: https://junyanz.github.io/CycleGAN/ — for others hiring AI PhDs, Jack also wanted to shout out his coauthor Zach Nussbaum, his coauthor on Nomic Embed: Training a Reproducible Long Context Text Embedder. Full Video Episode Timestamps 00:00 Introduction to Jack Morris01:18 Career in AI03:29 The Shift to AI Companies03:57 The Impact of ChatGPT04:26 The Role of Academia in AI05:49 The Emergence of Reasoning Models07:07 Challenges in Academia: GPUs and HPC Training11:04 The Value of GPU Knowledge14:24 Introduction to Jack's Research15:28 Information Theory17:10 Understanding Deep Learning Systems19:00 The "Bit" in Deep Learning20:25 Wikipedia and Information Storage23:50 Text Embeddings and Information Compression27:08 The Research Journey of Embedding Inversion31:22 Harnessing the Universal Geometry of Embeddings34:54 Implications of Embedding Inversion36:02 Limitations of Embedding Inversion38:08 The Capacity of Language Models40:23 The Cognitive Core and Model Efficiency50:40 The Future of AI and Model Scaling52:47 Approximating Language Model Training Data from Weights01:06:50 The "No New Ideas, Only New Datasets" Thesis This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Solving Poker and Diplomacy, Debating RL+Reasoning with Ilya, what’s *wrong* with the System 1/2 analogy, and where Test-Time Compute hits a wall Full Video Episode Timestamps 00:00 Intro – Diplomacy, Cicero & World Championship 02:00 Reverse Centaur: How AI Improved Noam’s Human Play 05:00 Turing Test Failures in Chat: Hallucinations & Steerability 07:30 Reasoning Models & Fast vs. Slow Thinking Paradigm 11:00 System 1 vs. System 2 in Visual Tasks (GeoGuessr, Tic-Tac-Toe) 14:00 The Deep Research Existence Proof for Unverifiable Domains 17:30 Harnesses, Tool Use, and Fragility in AI Agents 21:00 The Case Against Over-Reliance on Scaffolds and Routers 24:00 Reinforcement Fine-Tuning and Long-Term Model Adaptability 28:00 Ilya’s Bet on Reasoning and the O-Series Breakthrough 34:00 Noam’s Dev Stack: Codex, Windsurf & AGI Moments 38:00 Building Better AI Developers: Memory, Reuse, and PR Reviews 41:00 Multi-Agent Intelligence and the “AI Civilization” Hypothesis 44:30 Implicit World Models and Theory of Mind Through Scaling 48:00 Why Self-Play Breaks Down Beyond Go and Chess 54:00 Designing Better Benchmarks for Fuzzy Tasks 57:30 The Real Limits of Test-Time Compute: Cost vs. Time 1:00:30 Data Efficiency Gaps Between Humans and LLMs 1:03:00 Training Pipeline: Pretraining, Midtraining, Posttraining 1:05:00 Games as Research Proving Grounds: Poker, MTG, Stratego 1:10:00 Closing Thoughts – Five-Year View and Open Research Directions This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Emmanuel Amiesen is lead author of “Circuit Tracing: Revealing Computational Graphs in Language Models” (https://transformer-circuits.pub/2025/attribution-graphs/methods.html ), which is part of a duo of MechInterp papers that Anthropic published in March (alongside https://transformer-circuits.pub/2025/attribution-graphs/biology.html ). We recorded the initial conversation a month ago, but then held off publishing until the open source tooling for the graph generation discussed in this work was released last week: https://www.anthropic.com/research/open-source-circuit-tracing This is a 2 part episode - an intro covering the open source release, then a deeper dive into the paper — with guest host Vibhu Sapra (https://x.com/vibhuuuus ) and Mochi the MechInterp Pomsky (https://x.com/mochipomsky ). Thanks to Vibhu for making this episode happen! While the original blogpost contained some fantastic guided visualizations (which we discuss at the end of this pod!), with the notebook and Neuronpedia visualization (https://www.neuronpedia.org/gemma-2-2b/graph ) released this week, you can now explore on your own with Neuronpedia, as we show you in the video version of this pod. Full Video Episode Timestamps 00:00 Intro & Guest Introductions01:00 Anthropic's Circuit Tracing Release06:11 Exploring Circuit Tracing Tools & Demos13:01 Model Behaviors and User Experiments17:02 Behind the Research: Team and Community24:19 Main Episode Start: Mech Interp Backgrounds25:56 Getting Into Mech Interp Research31:52 History and Foundations of Mech Interp37:05 Core Concepts: Superposition & Features39:54 Applications & Interventions in Models45:59 Challenges & Open Questions in Interpretability57:15 Understanding Model Mechanisms: Circuits & Reasoning01:04:24 Model Planning, Reasoning, and Attribution Graphs01:30:52 Faithfulness, Deception, and Parallel Circuits01:40:16 Publishing Risks, Open Research, and Visualization01:49:33 Barriers, Vision, and Call to Action This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Solomon most famously created Docker and now runs Dagger… which has something special to share with you on Thursday. Catch Dagger at: - Tuesday: Dagger’s workshop https://www.ai.engineer/schedule#ship-agents-that-ship-a-hands-on-workshop-for-swe-agent-builders - Wednesday: Dagger’s talk: https://www.ai.engineer/schedule#how-to-trust-an-agent-with-software-delivery - Thursday: Solomon’s Keynote https://www.ai.engineer/schedule#containing-agent-chaos Full Video Episode Timestamps 00:00 Introduction & Guest Background00:29 What is Dagger? Post-Development Automation01:08 Dagger’s Community & Platform Engineers02:32 AI Agents and Developer Workflows03:40 Environment Isolation & The Power of Containers06:28 The Need for Standards in Agent Environments07:25 Design Constraints & Challenges for Dev Environments11:26 Limitations of Current Tools & Agent-Native UX14:11 Modularity, Customization, and the Lego Analogy16:24 Convergence of CICD and Agentic Systems17:41 Ephemeral Apps, Resource Constraints, and Local Execution21:01 Adoption, Ecosystem, and the Role of Open Source23:30 Dagger’s Modular Approach & Integration Philosophy25:38 Looking Ahead: Workshops, Keynotes, and the Future of Agentic Infrastructure This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
As part of our AI Engineer World’s Fair preview, we’re releasing a special cross podcast recorded with Sam Charrington of TWiML AI at last week’s Google I/O! TUESDAY: Shrestha and Kwindla’s workshop: https://www.ai.engineer/schedule#milliseconds-to-magic-real-time-workflows-using-the-gemini-live-api-and-pipecat TUESDAY: Kwindla’s workshop: https://www.ai.engineer/schedule#building-voice-agents-with-gemini-and-pipecat WEDNESDAY: Shrestha and Kwindla’s talk: https://www.ai.engineer/schedule#milliseconds-to-magic-real-time-workflows-using-the-gemini-live-api-and-pipecat WEDNESDAY: Kwindla’s keynote: https://www.ai.engineer/schedule#-voice-keynote-your-realtime-ai-is-ngmi THURSDAY: Logan’s keynote: https://www.ai.engineer/schedule#a-year-of-gemini-progress-what-comes-next Catch all the speakers at AIE (both workshops and talks): Logan Kilpatrick: https://www.latent.space/p/chatgpt-gpt4-hype-and-building-llm Shrestha Basu Mallick: https://www.linkedin.com/in/shresthabm/ Kwindla Hultman Kramer: https://www.linkedin.com/in/kwkramer Full Video Episode This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
One of the new tracks at next week’s AI Engineer conference in SF is a new focus on LLMs + Robotics, ft. household names like Waymo and Physical Intelligence. However there are many other companies applying LLMs and VLMs in the real world! CloudChef, the first industrial-scale kitchen robotics company with one-shot demonstration learning and an incredibly simple business model, will be serving tasty treats all day with Zippy (https://www.cloudchef.co/zippy ) their AI Chef platform. This is a lightning pod with CEO Nikhil Abraham to preview what Zippy is capable of! https://www.cloudchef.co/platform See a real chef comparison: See it in the AI Engineer Expo at SF next week: https://ai.engineer Full Video Episode Timestamps 00:00 Welcome and Introductions00:58 What is Cloud Chef?01:36 How the Robots Work: Culinary Intelligence05:57 Commercial Applications and Early Success07:02 The Software-First Approach10:09 Business Model and Pricing13:10 Demonstration Learning: Training the Robots16:03 Call to Action and Engineering Opportunities18:45 Final Thoughts and Technical Details This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
We are joined by Eno Reyes and Matan Grinberg, the co-founders of Factory.ai. They are building droids for autonomous software engineering, handling everything from code generation to incident response for production outages. After raising a $15M Series A from Sequoia, they just released their product in GA! https://factory.ai/ https://x.com/latentspacepod Full Video Episode Timestamps 00:00 Introductions 00:35 Meeting at Langchain Hackathon 04:02 Building Factory despite early model limitations 06:56 What is Factory AI? 08:55 Delegation vs Collaboration in AI Development Tools 10:06 Naming Origins of 'Factory' and 'Droids' 12:17 Defining Droids: Agent vs Workflow 14:34 Live Demo17:37 Enterprise Context and Tool Integration in Droids 20:26 Prompting, Clarification, and Agent Communication 22:28 Project Understanding and Proactive Context Gathering 24:10 Why SWE-Bench Is Dead 28:47 Model Fine-tuning and Generalization Challenges 31:07 Why Factory is Browser-Based, Not IDE-Based 33:51 Test-Driven Development and Agent Verification 36:17 Retrieval vs Large Context Windows for Cost Efficiency 38:02 Enterprise Metrics: Code Churn and ROI 40:48 Executing Large Refactors and Migrations with Droids 45:25 Model Speed, Parallelism, and Delegation Bottlenecks 50:11 Observability Challenges and Semantic Telemetry 53:44 Hiring55:19 Factory's design and branding approach 58:34 Closing Thoughts and Future of AI-Native Development This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
In an otherwise heavy week packed with Microsoft Build, Google I/O, and OpenAI io, the worst kept secret in biglab land was the launch of Claude 4, particularly the triumphant return of Opus, which many had been clamoring for. We will leave the specific Claude 4 recap to AINews, however we think that both Gemini’s progress on Deep Think this week and Claude 4 represent the next frontier of progress on inference time compute/reasoning (at last until GPT5 ships this summer). Will Brown’s talk at AIE NYC and open source work on verifiers have made him one of the most prominent voices able to publicly discuss (aka without the vaguepoasting LoRA they put on you when you join a biglab) the current state of the art in reasoning models and where current SOTA research directions lead. We discussed his latest paper on Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment and he has previewed his AIEWF talk on Agentic RL for those with the temerity to power thru bad meetup audio. Full Video Episode Timestamps 00:00 Introduction to the Podcast and Guests01:00 Discussion on Claude 4 and AI Models03:07 Extended Thinking and Tool Use in AI06:47 Technical Highlights and Model Trustworthiness10:31 Thinking Budgets and Their Implications13:38 Controversy Surrounding Opus and AI Ethics18:49 Reflections on AI Tools and Their Limitations21:58 The Chaos of Predictive Systems22:56 Marketing and Safety in AI Models24:30 Evaluating AI Companies and Their Strategies25:53 The Role of Academia in AI Evaluations27:43 Teaching Taste in Research28:41 Making Educated Bets in AI Research30:12 Recent Developments in Multi-Turn Tool Use32:50 Incentivizing Tool Use in AI Models34:45 The Future of Reward Models in AI39:10 Exploring Flexible Reward Systems This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Note from your hosts: we were off this week for ICLR and RSA! This week we’re bringing you one of the top episodes from our lightning podcast series, the shorter format, Youtube-only side podcast we do for breaking news and faster turnaround. Please support our work on YouTube! https://www.youtube.com/playlist?list=PLWEAb1SXhjlc5qgVK4NgehdCzMYCwZtiB The explosion of embedding-based applications created a new challenge: efficiently storing, indexing, and searching these high-dimensional vectors at scale. This gap gave rise to the vector database category, with companies like Pinecone leading the charge in 2022-2023 by defining specialized infrastructure for vector operations. The category saw explosive growth following ChatGPT’s launch in late 2022, as developers rushed to build AI applications using Retrieval-Augmented Generation (RAG). This surge was partly driven by a widespread misconception that embedding-based similarity search was the only viable method for retrieving context for LLMs!!! The resulting “vector database gold rush” saw massive investment and attention directed toward vector search infrastructure, even though traditional information retrieval techniques remained equally valuable for many RAG applications. Full Video Episode Timestamps 00:00 Introduction to Trondheim and Background03:03 The Rise and Fall of Vector Databases06:08 Convergence of Search Technologies09:04 Embeddings and Their Importance12:03 Building Effective Search Systems15:00 RAG Applications and Recommendations17:55 The Role of Knowledge Graphs20:49 Future of Embedding Models and Innovations This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
We’ll keep this brief because we’re on a tight turnaround: GPT 4.1, previously known as the Quasar and Optimus models, is now live as the natural update for 4o/4o-mini (and the research preview of GPT 4.5). Though it is a general purpose model family, the headline features are: Coding abilities (o1-level SWEBench and SWELancer, but ok Aider) Instruction Following (with a very notable prompting guide) Long Context up to 1m tokens (with new MRCR and Graphwalk benchmarks) Vision (simply o1 level) Cheaper Pricing (cheaper than 4o, greatly improved prompt caching savings) We caught up with returning guest Michelle Pokrass and Josh McGrath to get more detail on each! Full Video Episode Timestamps Part 100:00:00 Introduction and Guest Welcome00:00:57 GPT 4.1 Launch Overview00:01:54 Developer Feedback and Model Names00:02:53 Model Naming and Starry Themes00:03:49 Confusion Over GPT 4.1 vs 4.500:04:47 Distillation and Model Improvements00:05:45 Omnimodel Architecture and Future Plans00:06:43 Core Capabilities of GPT 4.100:07:40 Training Techniques and Long Context00:08:37 Challenges in Long Context Reasoning00:09:34 Context Utilization in ModelsPart 200:10:31 Graph Walks and Model Evaluation00:11:31 Real Life Applications of Graph Tasks00:12:30 Multi-Hop Reasoning Benchmarks00:13:30 Agentic Workflows and Backtracking00:14:28 Graph Traversals for Agent Planning00:15:24 Context Usage in API and Memory Systems00:16:21 Model Performance in Long Context Tasks00:17:17 Instruction Following and Real World Data00:18:12 Challenges in Grading Instructions00:19:09 Instruction Following Techniques00:20:09 Prompting Techniques and Model Responses00:21:05 Agentic Workflows and Model PersistencePart 300:22:01 Balancing Persistence and User Control00:22:56 Evaluations on Model Edits and Persistence00:23:55 XML vs JSON in Prompting00:24:50 Instruction Placement in Context00:25:49 Optimizing for Prompt Caching00:26:49 Chain of Thought and Reasoning Models00:27:46 Choosing the Right Model for Your Task00:28:46 Coding Capabilities of GPT 4.100:29:41 Model Performance in Coding Tasks00:30:39 Understanding Coding Model Differences00:31:36 Using Smaller Models for Coding00:32:33 Future of Coding in OpenAIPart 400:33:28 Internal Use and Success Stories00:34:26 Vision and Multi-Modal Capabilities00:35:25 Screen vs Embodied Vision00:36:22 Vision Benchmarks and Model Improvements00:37:19 Model Deprecation and GPU Usage00:38:13 Fine-Tuning and Preference Steering00:39:12 Upcoming Reasoning Models00:40:10 Creative Writing and Model Humor00:41:07 Feedback and Developer Community00:42:03 Pricing and Blended Model Costs00:44:02 Conclusion and Wrap-Up This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
We are calling for the world’s best AI Engineer talks for AI Architects, /r/localLlama, Model Context Protocol (MCP), GraphRAG, AI in Action, Evals, Agent Reliability, Reasoning and RL, Retrieval/Search/RecSys , Security, Infrastructure, Generative Media, AI Design & Novel AI UX, AI Product Management, Autonomy, Robotics, and Embodied Agents, Computer-Using Agents (CUA), SWE Agents, Vibe Coding, Voice, Sales/Support Agents at AIEWF 2025! Fill out the 2025 State of AI Eng survey for $250 in Amazon cards and see you from Jun 3-5 in SF! Coreweave’s now-successful IPO has led to a lot of questions about the GPU Neocloud market, which Dylan Patel has written extensively about on SemiAnalysis. Understanding markets requires an interesting mix of technical and financial expertise, so this will be a different kind of episode than our usual LS domain. When we first published $2 H100s: How the GPU Rental Bubble Burst, we got 2 kinds of reactions on Hacker News: * “Ah, now the AI bubble is imploding!” * “Duh, this is how it works in every GPU cycle, are you new here?” We don’t think either reaction is quite right. Specifically, it is not normal for the prices of one of the world’s most important resources right now to swing from $1 to $8 per hour based on drastically inelastic demand AND supply curves - from 3 year lock-in contracts to stupendously competitive over-ordering dynamics for NVIDIA allocations — especially with increasing baseline compute needed for even the simplest academic ML research and for new AI startups getting off the ground. We’re fortunate today to have Evan Conrad, CEO of SFCompute, one of the most exciting GPU marketplace startups, talk us through his theory of the economics of GPU markets, and why he thinks CoreWeave and Modal are well positioned, but Digital Ocean and Together are not. However, more broadly, the entire point of SFC is creating liquidity between GPU owners and consumers and making it broadly tradable, even programmable: As we explore, these are the primitives that you can then use to create your own, high quality, custom GPU availability for your time and money budget, similar to how Amazon Spot Instances automated the selective buying of unused compute. The ultimate end state of where all this is going is GPU that trade like other perishable, staple commodities of the world - oil, soybeans, milk. Because the contracts and markets are so well established, the price swings also are not nearly as drastic, and people can also start hedging and managing the risk of one of the biggest costs of their business, just like we have risk-managed commodities risks of all other sorts for centuries. As a former derivatives trader, you can bet that swyx doubleclicked on that… Show Notes * SF Compute * Evan Conrad * Ethan Anderson * John Phamous * The Curve talk * CoreWeave * Andromeda Cluster Full Video Pod Like and subscribe! Timestamps * [00:00:05] Introductions * [00:00:12] Introduction of guest Evan Conrad from SF Compute * [00:00:12] CoreWeave Business Model Discussion * [00:05:37] CoreWeave as a Real Estate Business * [00:08:59] Interest Rate Risk and GPU Market Strategy Framework * [00:16:33] Why Together and DigitalOcean will lose money on their clusters * [00:20:37] SF Compute's AI Lab Origins * [00:25:49] Utilization Rates and Benefits of SF Compute Market Model * [00:30:00] H100 GPU Glut, Supply Chain Issues, and Future Demand Forecast * [00:34:00] P2P GPU networks * [00:36:50] Customer stories * [00:38:23] VC-Provided GPU Clusters and Credit Risk Arbitrage * [00:41:58] Market Pricing Dynamics and Preemptible GPU Pricing Model * [00:48:00] Future Plans for Financialization? * [00:52:59] Cluster auditing and quality control * [00:58:00] Futures Contracts for GPUs * [01:01:20] Branding and Aesthetic Choices Behind SF Compute * [01:06:30] Lessons from Previous Startups * [01:09:07] Hiring at SF Compute Transcript Alessio [00:00:05]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:12]: Hey, and today we're so excited to be finally in the studio with Evan Conrad from SF Compute. Welcome. I've been fortunate enough to be your friend before you were famous, and also we've hung out at various social things. So it's really cool to see that SF Compute is coming into its own thing, and it's a significant presence, at least in the San Francisco community, which of course, it's in the name, so you couldn't help but be. Evan: Indeed, indeed. I think we have a long way to go, but yeah, thanks. Swyx: Of course, yeah. One way I was thinking about kicking on this conversation is we will likely release this right after CoreWeave IPO. And I was watching, I was looking, doing some research on you. You did a talk at The Curve. I think I may have been viewer number 70. It was a great talk. More people should go see it, Evan Conrad at The Curve. But we have like three orders of magnitude more people. And I just wanted to, to highlight, like, what is your analysis of what CoreWeave did that went so right for them? Evan: Sell locked-in long-term contracts and don't really do much short-term at all. I think like a lot of people had this assumption that GPUs would work a lot like CPUs and the like standard business model of any sort of CPU cloud is you buy commodity hardware, then you lay on services that are mostly software, and that gives you high margins and pretty much all your value comes from those services. Not really the underlying. Compute in any capacity and because it's commodity hardware and it's not actually that expensive, most of that can be sort of on-demand compute. And while you do want locked-in contracts for folks, it's mostly just a sort of de-risk situation. It helps you plan revenue because you don't know if people are going to scale up or down. But fundamentally, people are like buying hourly and that's how your business is structured and you make 50 percent margins or higher. This like doesn't really work in GPUs. And the reason why it doesn't work is because you end up with like super price sensitive customers. And that isn't because necessarily it's just way more expensive, though that's totally the case. So in a CPU cloud, you might have like, you know, let's say if you had a million dollars of hardware in GPUs, you have a billion dollars of hardware. And so your customers are buying at much higher volumes than you otherwise expect. And it's also smaller customers who are buying at higher amounts of volume. So relative to what they're spending in general. But in GPUs in particular, your customer cares about the scaling law. So if you take like Gusto, for example, or Rippling or an HR service like this, when they're buying from an AWS or a GCP, they're buying CPUs and they're running web servers, those web servers, they kind of buy up to the capacity that they need, they buy enough, like CPUs, and then they don't buy any more, like, they don't buy any more at all. Yeah, you have a chart that goes like this and then flat. Correct. And it's like a complete flat. It's not even like an incremental tiny amount. It's not like you could just like turn on some more nodes. Yeah. And then suddenly, you know, they would make an incremental amount of money more, like Gusto isn't going to make like, you know, 5% more money, they're gonna make zero, like literally zero money from every incremental GPU or CPU after a certain point. This is not the case for anyone who is training models. And it's not the case for anyone who's doing test time inference or like inference that has scales at test time. Because like you, your scaling laws mean that you may have some diminishing returns, but there's always returns. Adding GPUs always means your model does actually get. And that actually does translate into revenue for you. And then for test time inference, you actually can just like run the inference longer and get a better performance. Or maybe you can run more customers faster and then charge for that. It actually does translate into revenue. Every incremental GPU translates to revenue. And what that means from the customer's perspective is you've got like a flat budget and you're trying to max the amount of GPUs you have for that budget. And it's very distinctly different than like where Augusto or Rippling might think, where they think, oh, we need this amount of CPUs. How do we, you know, reduce that? How do we reduce our amount of money that we're spending on this to get the same amount of CPUs? What that translates to is customers who are spending in really high volume, but also customers who are super price sensitive, who don't give a s**t. Can I swear on this? Can I swear? Yeah. Who don't give a s**t at all about your software. Because a 10% difference in a billion dollars of hardware is like $100 million of value for you. So if you have a 10% margin increase because you have great software, on your billion, the customers are that price sensitive. They will immediately switch off if they can. Because why wouldn't you? You would just take that $100 million. You'd spend $50 million on hiring a software engineering team to replicate anything that you possibly did. So that means that the best way to make money in GPUs was to do basically exactly what CoreWeave did, which is go out and sign only long-term contracts, pretty much ignore the bottom end of the market completely, and then maximize your long-term contracts. With customers who don't have credit risk, who won't sue you, or are unlikely to sue you for frivolous reasons. And then because they don't have credit risk and they won't sue you for frivolous reasons, you can go back to your lender and you can say, look, this is a really low risk situation for us to do. You should give me prime, prime interest rate. You should give me the lowest cost of capital you possibly can. And when you do that, you just
We are happy to announce that there will be a dedicated MCP track at the 2025 AI Engineer World's Fair, taking place Jun 3rd to 5th in San Francisco, where the MCP core team and major contributors and builders will be meeting. Join us and apply to speak or sponsor! When we first wrote Why MCP Won, we had no idea how quickly it was about to win. In the past 4 weeks, OpenAI and now Google have now announced the MCP support, effectively confirming our prediction that MCP was the presumptive winner of the agent standard wars. MCP has now overtaken OpenAPI, the incumbent option and most direct alternative, in GitHub stars (3 months ahead of conservative trendline): We have explored the state of MCP at AIE (now the first ever >100k views workshop): And since then, we’ve added a 7th reason why MCP won - this team acts very quickly on feedback, with the 2025-03-26 spec update adding support for stateless/resumable/streamable HTTP transports, and comprehensive authz capabilities based on OAuth 2.1. This bodes very well for the future of the community and project. For protocol and history nerds, we also asked David and Justin to tell the origin story of MCP, which we leave to the reader to enjoy (you can also skim the transcripts, or, the changelogs of a certain favored IDE). It’s incredible the impact that individual engineers solving their own problems can have on an entire industry. Full video episode Like and subscribe on YouTube! Show Links * David * Justin * MCP * Why MCP Won Timestamps * 00:00 Introduction and Guest Welcome * 00:37 What is MCP? * 02:00 The Origin Story of MCP * 05:18 Development Challenges and Solutions * 08:06 Technical Details and Inspirations * 29:45 MCP vs Open API * 32:48 Building MCP Servers * 40:39 Exploring Model Independence in LLMs * 41:36 Building Richer Systems with MCP * 43:13 Understanding Agents in MCP * 45:45 Nesting and Tool Confusion in MCP * 49:11 Client Control and Tool Invocation * 52:08 Authorization and Trust in MCP Servers * 01:01:34 Future Roadmap and Stateless Servers * 01:10:07 Open Source Governance and Community Involvement * 01:18:12 Wishlist and Closing Remarks Transcript Alessio [00:00:02]: Hey, everyone. Welcome back to Latent Space. This is Alessio, partner and CTO at Decibel, and I'm joined by my co-host Swyx, founder of Small AI. swyx [00:00:10]: Hey, morning. And today we have a remote recording, I guess, with David and Justin from Anthropic over in London. Welcome. Hey, good You guys have created a storm of hype because of MCP, and I'm really glad to have you on. Thanks for making the time. What is MCP? Let's start with a crisp what definition from the horse's mouth, and then we'll go into the origin story. But let's start off right off the bat. What is MCP? Justin/David [00:00:43]: Yeah, sure. So Model Context Protocol, or MCP for short, is basically something we've designed to help AI applications extend themselves or integrate with an ecosystem of plugins, basically. The terminology is a bit different. We use this client-server terminology, and we can talk about why that is and where that came from. But at the end of the day, it really is that. It's like extending and enhancing the functionality of AI application. swyx [00:01:05]: David, would you add anything? Justin/David [00:01:07]: Yeah, I think that's actually a good description. I think there's like a lot of different ways for how people are trying to explain it. But at the core, I think what Justin said is like extending AI applications is really what this is about. And I think the interesting bit here that I want to highlight, it's AI applications and not models themselves that this is focused on. That's a common misconception that we can talk about a bit later. But yeah. Another version that we've used and gotten to like is like MCP is kind of like the USB-C port of AI applications and that it's meant to be this universal connector to a whole ecosystem of things. swyx [00:01:44]: Yeah. Specifically, an interesting feature is, like you said, the client and server. And it's a sort of two-way, right? Like in the same way that said a USB-C is two-way, which could be super interesting. Yeah, let's go into a little bit of the origin story. There's many people who've tried to make statistics. There's many people who've tried to build open source. I think there's an overall, also, my sense is that Anthropic is going hard after developers in the way that other labs are not. And so I'm also curious if there was any external influence or was it just you two guys just in a room somewhere riffing? Justin/David [00:02:18]: It is actually mostly like us two guys in a room riffing. So this is not part of a big strategy. You know, if you roll back time a little bit and go into like July 2024. I was like, started. I started at Anthropic like three months earlier or two months earlier. And I was mostly working on internal developer tooling, which is what I've been doing for like years and years before. And as part of that, I think there was an effort of like, how do I empower more like employees at Anthropic to use, you know, to integrate really deeply with the models we have? Because we've seen these, like, how good it is, how amazing it will become even in the future. And of course, you know, just dogfoot your own model as much as you can. And as part of that. From my development tooling background, I quickly got frustrated by the idea that, you know, on one hand side, I have Cloud Desktop, which is this amazing tool with artifacts, which I really enjoyed. But it was very limited to exactly that feature set. And it was there was no way to extend it. And on the other hand side, I like work in IDEs, which could greatly like act on like the file system and a bunch of other things. But then they don't have artifacts or something like that. And so what I constantly did was just copy. Things back and forth on between Cloud Desktop and the IDE, and that quickly got me, honestly, just very frustrated. And part of that frustration wasn't like, how do I go and fix this? What, what do we need? And back to like this development developer, like focus that I have, I really thought about like, well, I know how to build all these integrations, but what do I need to do to let these applications let me do this? And so it's very quickly that you see that this is clearly like an M times N problem. Like you have multiple like applications. And multiple integrations you want to build and like, what that is better there to fix this than using a protocol. And at the same time, I was actually working on an LSP related thing internally that didn't go anywhere. But you put these things together in someone's brain and let them wait for like a few weeks. And out of that comes like the idea of like, let's build some, some protocol. And so back to like this little room, like it was literally just me going to a room with Justin and go like, I think we should build something like this. Uh, this is a good idea. And Justin. Lucky for me, just really took an interest in the idea, um, and, and took it from there to like, to, to build something, together with me, that's really the inception story is like, it's us to, from then on, just going and building it over, over the course of like, like a month and a half of like building the protocol, building the first integration, like Justin did a lot of the, like the heavy lifting of the first integrations in cloud desktop. I did a lot of the first, um, proof of concept of how this can look like in an IDE. And if you, we could talk about like some of. All the tidbits you can find way before the inception of like before the official release, if you were looking at the right repositories at the right time, but there you go. That's like some of the, the rough story. Alessio [00:05:12]: Uh, what was the timeline when, I know November 25th was like the official announcement date. When did you guys start working on it? Justin/David [00:05:19]: Justin, when did we start working on that? I think it, I think it was around July. I think, yeah, I, as soon as David pitched this initial idea, I got excited pretty quickly and we started working on it, I think. I think almost immediately after that conversation and then, I don't know, it was a couple, maybe a few months of, uh, building the really unrewarding bits, if we're being honest, because for, for establishing something that's like this communication protocol has clients and servers and like SDKs everywhere, there's just like a lot of like laying the groundwork that you have to do. So it was a pretty, uh, that was a pretty slow couple of months. But then afterward, once you get some things talking over that wire, it really starts to get exciting and you can start building. All sorts of crazy things. And I think this really came to a head. And I don't remember exactly when it was, maybe like approximately a month before release, there was an internal hackathon where some folks really got excited about MCP and started building all sorts of crazy applications. I think the coolest one of which was like an MCP server that can control a 3d printer or something. And so like, suddenly people are feeling this power of like cloud connecting to the outside world in a really tangible way. And that, that really added some, uh, some juice to us and to the release. Alessio [00:06:32]: Yeah. And we'll go into the technical details, but I just want to wrap up here. You mentioned you could have seen some things coming if you were looking in the right places. We always want to know what are the places to get alpha, how, how, how to find MCP early. Justin/David [00:06:44]: I'm a big Zed user. I liked the Zed editor. The first MCP implementation on an IDE was in Zed. It was written by me and it was there like a month and a half before the official release. Just because we needed to do it in the open because it's an open source project. Um, and so it was, it was not, it was
If you’re in SF: Join us for the Claude Plays Pokemon hackathon this Sunday! If you’re not: Fill out the 2025 State of AI Eng survey for $250 in Amazon cards! Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs. Top guests: Noam Shazeer, Bob McGrew, Noam Brown, Dylan Patel, Percy Liang, David Luan Full Episode on Their YouTube Timestamps * 00:00 Introduction and Excitement for Collaboration * 00:27 Reflecting on Surprises in AI Over the Past Year * 01:44 Open Source Models and Their Adoption * 06:01 The Rise of GPT Wrappers * 06:55 AI Builders and Low-Code Platforms * 09:35 Overhyped and Underhyped AI Trends * 22:17 Product Market Fit in AI * 28:23 Google's Current Momentum * 28:33 Customer Support and AI * 29:54 AI's Impact on Cost and Growth * 31:05 Voice AI and Scheduling * 32:59 Emerging AI Applications * 34:12 Education and AI * 36:34 Defensibility in AI Applications * 40:10 Infrastructure and AI * 47:08 Challenges and Future of AI * 52:15 Quick Fire Round and Closing Remarks Transcript [00:00:00] Introduction and Podcast Overview [00:00:00] Jacob: well, thanks so much for doing this, guys. I feel like we've we've been excited to do a collab for a while. I [00:00:13] swyx: love crossovers. Yeah. Yeah. This, this is great. Like the ultimate meta about just podcasters talking to other podcasters. Yeah. It's a lot. Podcasts all the way up. [00:00:21] Jacob: I figured we'd have a pretty free ranging conversation today but brought a few conversation starters to, to, to kick us off. [00:00:27] Reflecting on AI Surprises and Trends [00:00:27] Jacob: And so I figured one interesting place to start is you know, obviously it feels that this world is changing like every few months. Wondering as you guys reflect path on the past year, like what surprised you the most? [00:00:36] Alessio: I think definitely recently models we kinda on the, on the right here. Like, oh, that, well, I, I I think there's, there's like the, what surprised us in a good way. [00:00:44] May maybe in a, in a bad way. I would say in a good way. Recently models and I think the release of them right after the new reps scaling instead talked by Ilia. I think there was maybe like a, a little. It's so over and then we're so back. I'm like such a short, short period. It was really [00:01:00] fortuitous [00:01:00] Jacob: timing though, like right. [00:01:01] As pre-training died, I mean, obviously I'm sure within the labs they knew pre-training was dying and had to find something. But you know, from the outside it was it, it felt like one right into the other. [00:01:09] Alessio: Yeah. Yeah, exactly. So that, that was a good surprise, [00:01:12] swyx: I would say, if you wanna make that comment about timing, I think it's suspiciously neat that like, because we know that Strawberry was being worked on for like two years-ish. [00:01:20] Like, and we know exactly when Nome joined OpenAI, and that was obviously a big strategic bet by OpenAI. So like, for it to transition, so transition so nicely when like, pre-training is kind of tapped out to, into like, oh, now inference time is, is the new scaling law is like conv very convenient. I, I, I like if there were an Illuminati, this would be what they planned. [00:01:41] Or if we're living in a simulation or something. Yeah. [00:01:44] Open Source Models and Their Impact [00:01:44] swyx: Then you said open source [00:01:45] Alessio: as well? Yeah. Well, no, I, I think like open source. Yeah. We're discussing this on the negative. I would say the relevance of open source. I would specifically open models. Yeah, I was surprised the lack, like the llamas of the world by the lack of adoption. [00:01:56] And I mean, people use it obviously, but I would say nobody's [00:02:00] really like a huge fanboy, you know, I think the local llama community and some of the more obvious use cases really like it. But when we talk to like enterprise folks, it's like, it's cool, you know? And I think people love to argue about licenses and all of that, but the reality is that it doesn't really change the adoption path of, of ai. [00:02:18] So [00:02:19] swyx: yeah, the specific stat that I got from on anchor from Braintrust mm-hmm. In one of the episodes that we did was I think he estimated that open source model usage in work in enterprises is that like 5% and going down. [00:02:31] Jacob: And it feels like you're basically all these enterprises are in like use case discovery mode, where it's like, let's just take what we think is the most powerful model and figure out if we can find anything that works. [00:02:39] And, you know, so much of, of, of it feels like discovery of that. And then, right, as you've discovered something, a new generation of models are out and so you have to go do discovery with those. And you know, I think obviously we're probably optimistic that the that the open source models increase in uptake. [00:02:50] It's funny, I was gonna say my biggest surprise in the last year was open source related, but it was just how Fast Open Source caught up on the reasoning models. It was kind of unclear to me, like over time whether there would be, you know, [00:03:00] a compounding advantage for some of the closed source models where in the, okay, in the early days of, of scaling you know, there was a, a tight time loop, but over time, you know, would would the gap increase? [00:03:08] And if anything it feels like a trunk. You know, and I think deep seek specifically was just really surprising in how, you know, in many ways if the value of these model companies is like you have a model for a period of time and you're the only one that can build products on top of that model while you have it. [00:03:21] Like, God, that time period is a lot shorter than a, than I thought it was gonna be a year ago. [00:03:25] swyx: Yeah. I mean, again, I I, I don't like this label of how Fast Open Source caught up because it's really how Fast Deepsea caught up. Right. And now we have, like, I think some of it is that Deepsea is basically gonna stop open sourcing models. [00:03:36] Yeah. So like there, there's no team open source, there's just different companies and they choose to open source or not. And we got lucky with deep seek releasing something and then everyone else is basically distilling from deep seek and those are distillations. Catching up is such an easier lower bar than like actually catching up, which is like you, you are like from scratch. [00:03:56] You're training something that like is competitive on that front. I don't know if [00:04:00] that's happening. Like basically the only player right now is we're waiting for LA four. [00:04:03] Jordan: I mean, it's always an order of magnitude cheaper to replicate what's already been done than to create something fundamentally new. [00:04:09] And so that's why I think deep seek overall was overhyped. Right? I mean obviously it's a good open source, new entrant, but at the same time there's nothing new fundamentally there other than sort of doing it executing what's already been done really well. [00:04:21] Alessio: Yeah, [00:04:21] Jordan: right. [00:04:21] Alessio: So Well, but I think the traces is like maybe the biggest thing, I think most previous open models is like the same model, just a little worse and cheaper. [00:04:30] Yeah. Like R one is like the first model that had the full traces. So I think that's like a net unique thing in fair, open source. But yeah, I, I think like we talked about deep seek in the our n of year 2023 recap, and we're mostly focused on cheaper inference. Like we didn't really have deep, see, deep CV three [00:04:47] swyx: was out then, and we were like, that was already like talking about fine green mixture of experts and all that. [00:04:51] Like that's a great receipt to [00:04:52] Jacob: have [00:04:52] swyx: to be like, yeah. [00:04:52] Jacob: End [00:04:53] swyx: of year 20. Yeah. That's a, [00:04:54] Jacob: that's a, that's, that's an [00:04:55] swyx: impressive one. You follow the right whale believers in Twitter. It's, it's like [00:05:00] pretty obvious. I actually had like so, you know, I used to be in finance and, and a lot, a lot of my hedge fund and PE friends called me up. [00:05:06] They were like, why didn't you tip us off on deep seek? And I'm like, well, I mean, it's been there. It's, it's actually like kind of surprising that like, Nvidia like fell like what, 15% in one day? Yeah. Because deep seek and I, I think it's just like whatever the market, public market narrative decides is a story, becomes the story, but really like the technical movements are usually. [00:05:26] One to two years in the making. Before that, [00:05:27] Jacob: basically these people were telling on themselves that they didn't listen to your podcast. They've been on the end of year 22, 3. No, no, [00:05:32] swyx: no. Like yeah, we weren't, we weren't like banging the drum. So like it's also on us to be like, no, like this. This is an actual tipping point. [00:05:38] And I think I like as people who are like, our function as podcasters and industry analysts is to raise the bar or focus attention on things that you think matter. And sometimes we're too passive about it. And I think I was too passive there. I'd be, I'd be happy to own up on that. [00:05:52] Jacob: No, I feel like over time you guys have moved into this margin general role of like taking stances of things that are or aren't important and, you know I feel like you've done that with MCP of [00:06:00] late and a bunch of [00:06:00] swyx: things. [00:06:00] Yeah. [00:06:01] Challenges and Opportunities in AI Engineering [00:06:01] swyx: So like the, the general pushes is AI engineering, you know, like it's gotta, gotta wrap the shirt. And MCP is part
If you’re in SF: Join us for the Claude Plays Pokemon hackathon this Sunday! If you’re not: Fill out the 2025 State of AI Eng survey for $250 in Amazon cards! For this episode: Thanks to Matija and Dan and Meng Shao for sharing on socials. We are SO excited to share our conversation with Dharmesh Shah, co-founder of HubSpot and creator of Agent.ai. A particularly compelling concept we discussed is the idea of "hybrid teams" - the next evolution in workplace organization where human workers collaborate with AI agents as team members. Just as we previously saw hybrid teams emerge in terms of full-time vs. contract workers, or in-office vs. remote workers, Dharmesh predicts that the next frontier will be teams composed of both human and AI members. This raises interesting questions about team dynamics, trust, and how to effectively delegate tasks between human and AI team members. The discussion of business models in AI reveals an important distinction between Work as a Service (WaaS) and Results as a Service (RaaS), something Dharmesh has written extensively about. While RaaS has gained popularity, particularly in customer support applications where outcomes are easily measurable, Dharmesh argues that this model may be over-indexed. Not all AI applications have clearly definable outcomes or consistent economic value per transaction, making WaaS more appropriate in many cases. This insight is particularly relevant for businesses considering how to monetize AI capabilities. The technical challenges of implementing effective agent systems are also explored, particularly around memory and authentication. Shah emphasizes the importance of cross-agent memory sharing and the need for more granular control over data access. He envisions a future where users can selectively share parts of their data with different agents, similar to how OAuth works but with much finer control. This points to significant opportunities in developing infrastructure for secure and efficient agent-to-agent communication and data sharing. Other highlights from our conversation * The Evolution of AI-Powered Agents – Exploring how AI agents have evolved from simple chatbots to sophisticated multi-agent systems, and the role of MCPs in enabling that. * Hybrid Digital Teams and the Future of Work – How AI agents are becoming teammates rather than just tools, and what this means for business operations and knowledge work. * Memory in AI Agents – The importance of persistent memory in AI systems and how shared memory across agents could enhance collaboration and efficiency. * Business Models for AI Agents – Exploring the shift from software as a service (SaaS) to work as a service (WaaS) and results as a service (RaaS), and what this means for monetization. * The Role of Standards Like MCP – Why MCP has been widely adopted and how it enables agent collaboration, tool use, and discovery. * The Future of AI Code Generation and Software Engineering – How AI-assisted coding is changing the role of software engineers and what skills will matter most in the future. * Domain Investing and Efficient Markets – Dharmesh’s approach to domain investing and how inefficiencies in digital asset markets create business opportunities. * The Philosophy of Saying No – Lessons from "Sorry, Must Pass" and how prioritization leads to greater productivity and focus. Full Video Episode on youtube! Timestamps * 00:00 Introduction and Guest Welcome * 02:29 Dharmesh Shah's Journey into AI * 05:22 Defining AI Agents * 06:45 The Evolution and Future of AI Agents * 13:53 Graph Theory and Knowledge Representation * 20:02 Engineering Practices and Overengineering * 25:57 The Role of Junior Engineers in the AI Era * 28:20 Multi-Agent Systems and MCP Standards * 35:55 LinkedIn's Legal Battles and Data Scraping * 37:32 The Future of AI and Hybrid Teams * 39:19 Building Agent AI: A Professional Network for Agents * 40:43 Challenges and Innovations in Agent AI * 45:02 The Evolution of UI in AI Systems * 01:00:25 Business Models: Work as a Service vs. Results as a Service * 01:09:17 The Future Value of Engineers * 01:09:51 Exploring the Role of Agents * 01:10:28 The Importance of Memory in AI * 01:11:02 Challenges and Opportunities in AI Memory * 01:12:41 Selective Memory and Privacy Concerns * 01:13:27 The Evolution of AI Tools and Platforms * 01:18:23 Domain Names and AI Projects * 01:32:08 Balancing Work and Personal Life * 01:35:52 Final Thoughts and Reflections Transcript Alessio [00:00:04]: Hey everyone, welcome back to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI. swyx [00:00:12]: Hello, and today we're super excited to have Dharmesh Shah to join us. I guess your relevant title here is founder of Agent AI. Dharmesh [00:00:20]: Yeah, that's true for this. Yeah, creator of Agent.ai and co-founder of HubSpot. swyx [00:00:25]: Co-founder of HubSpot, which I followed for many years, I think 18 years now, gonna be 19 soon. And you caught, you know, people can catch up on your HubSpot story elsewhere. I should also thank Sean Puri, who I've chatted with back and forth, who's been, I guess, getting me in touch with your people. But also, I think like, just giving us a lot of context, because obviously, My First Million joined you guys, and they've been chatting with you guys a lot. So for the business side, we can talk about that, but I kind of wanted to engage your CTO, agent, engineer side of things. So how did you get agent religion? Dharmesh [00:01:00]: Let's see. So I've been working, I'll take like a half step back, a decade or so ago, even though actually more than that. So even before HubSpot, the company I was contemplating that I had named for was called Ingenisoft. And the idea behind Ingenisoft was a natural language interface to business software. Now realize this is 20 years ago, so that was a hard thing to do. But the actual use case that I had in mind was, you know, we had data sitting in business systems like a CRM or something like that. And my kind of what I thought clever at the time. Oh, what if we used email as the kind of interface to get to business software? And the motivation for using email is that it automatically works when you're offline. So imagine I'm getting on a plane or I'm on a plane. There was no internet on planes back then. It's like, oh, I'm going through business cards from an event I went to. I can just type things into an email just to have them all in the backlog. When it reconnects, it sends those emails to a processor that basically kind of parses effectively the commands and updates the software, sends you the file, whatever it is. And there was a handful of commands. I was a little bit ahead of the times in terms of what was actually possible. And I reattempted this natural language thing with a product called ChatSpot that I did back 20... swyx [00:02:12]: Yeah, this is your first post-ChatGPT project. Dharmesh [00:02:14]: I saw it come out. Yeah. And so I've always been kind of fascinated by this natural language interface to software. Because, you know, as software developers, myself included, we've always said, oh, we build intuitive, easy-to-use applications. And it's not intuitive at all, right? Because what we're doing is... We're taking the mental model that's in our head of what we're trying to accomplish with said piece of software and translating that into a series of touches and swipes and clicks and things like that. And there's nothing natural or intuitive about it. And so natural language interfaces, for the first time, you know, whatever the thought is you have in your head and expressed in whatever language that you normally use to talk to yourself in your head, you can just sort of emit that and have software do something. And I thought that was kind of a breakthrough, which it has been. And it's gone. So that's where I first started getting into the journey. I started because now it actually works, right? So once we got ChatGPT and you can take, even with a few-shot example, convert something into structured, even back in the ChatGP 3.5 days, it did a decent job in a few-shot example, convert something to structured text if you knew what kinds of intents you were going to have. And so that happened. And that ultimately became a HubSpot project. But then agents intrigued me because I'm like, okay, well, that's the next step here. So chat's great. Love Chat UX. But if we want to do something even more meaningful, it felt like the next kind of advancement is not this kind of, I'm chatting with some software in a kind of a synchronous back and forth model, is that software is going to do things for me in kind of a multi-step way to try and accomplish some goals. So, yeah, that's when I first got started. It's like, okay, what would that look like? Yeah. And I've been obsessed ever since, by the way. Alessio [00:03:55]: Which goes back to your first experience with it, which is like you're offline. Yeah. And you want to do a task. You don't need to do it right now. You just want to queue it up for somebody to do it for you. Yes. As you think about agents, like, let's start at the easy question, which is like, how do you define an agent? Maybe. You mean the hardest question in the universe? Is that what you mean? Dharmesh [00:04:12]: You said you have an irritating take. I do have an irritating take. I think, well, some number of people have been irritated, including within my own team. So I have a very broad definition for agents, which is it's AI-powered software that accomplishes a goal. Period. That's it. And what irritates people about it is like, well, that's so broad as to be completely non-useful. And I understand that. I understand the criticism. But in my mind, if you kind of fast forward months, I guess, in AI years, the implementation of it, and we're already starting to see this, and we'll talk ab
We are working with Amplify on the 2025 State of AI Engineering Survey to be presented at the AIE World’s Fair in SF! Join the survey to shape the future of AI Eng! We first met Snipd (affiliate link! we get a free month, you get a free month. but this is not a sponsored pod, we’ve never done one) over a year ago, and were immediately impressed by the design, but were doubtful about the behavior of snipping as the title behavior: Podcast apps are enormously sticky - Spotify spent almost $1b in podcast acquisitions and exclusive content just to get an 8% bump in market share among normies. However, after a disappointing Overcast 2.0 rewrite with no AI features in the last 3 years, I finally bit the bullet and switched to Snipd. It’s 2025, your podcast app should be able to let you search transcripts of your podcasts. Snipd is the best implementation of this so far. And yet they keep shipping: What impressed us wasn’t just how this tiny team of 4 was able to bootstrap a consumer AI app against massive titans and do so well; but also how seriously they think about learning through podcasts and improving retention of knowledge over time, aka “Duolingo for podcasts”. As an educational AI podcast, that’s a mission we can get behind. Full Video Pod Find us on YouTube! This was the first pod we’ve ever shot outdoors! Show Notes * How does Shazam work? * Flutter/FlutterFlow * wav2vec paper * Perplexity Online LLM * Google Search Grounding * Comparing Snipd transcription with our Bee episode * NIPS 2017 Flo Rida * Gustav Söderström - Background Audio Timestamps * [00:00:03] Takeaways from AI Engineer NYC * [00:00:17] Weather in New York. * [00:00:26] Swyx and Snipd. * [00:01:01] Kevin's AI summit experience. * [00:01:31] Zurich and AI. * [00:03:25] SigLIP authors join OpenAI. * [00:03:39] Zurich is very costly. * [00:04:06] The Snipd origin story. * [00:05:24] Introduction to machine learning. * [00:09:28] Snipd and user knowledge extraction. * [00:13:48] App's tech stack, Flutter, Python. * [00:15:11] How speakers are identified. * [00:18:29] The concept of "backgroundable" video. * [00:29:05] Voice cloning technology. * [00:31:03] Using AI agents. * [00:34:32] Snipd's future is multi-modal AI. * [00:36:37] Snipd and existing user behaviour. * [00:42:10] The app, summary, and timestamps. * [00:55:25] The future of AI and podcasting. * [1:14:55] Voice AI Transcript swyx [00:00:03]: Hey, I'm here in New York with Kevin Ben-Smith of Snipd. Welcome. Kevin [00:00:07]: Hi. Hi. Amazing to be here. swyx [00:00:09]: Yeah. This is our first ever, I think, outdoors podcast recording. Kevin [00:00:14]: It's quite a location for the first time, I have to say. swyx [00:00:18]: I was actually unsure because, you know, it's cold. It's like, I checked the temperature. It's like kind of one degree Celsius, but it's not that bad with the sun. No, it's quite nice. Yeah. Especially with our beautiful tea. With the tea. Yeah. Perfect. We're going to talk about Snips. I'm a Snips user. I'm a Snips user. I had to basically, you know, apart from Twitter, it's like the number one use app on my phone. Nice. When I wake up in the morning, I open Snips and I, you know, see what's new. And I think in terms of time spent or usage on my phone, I think it's number one or number two. Nice. Nice. So I really had to talk about it also because I think people interested in AI want to think about like, how can we, we're an AI podcast, we have to talk about the AI podcast app. But before we get there, we just finished. We just finished the AI Engineer Summit and you came for the two days. How was it? Kevin [00:01:07]: It was quite incredible. I mean, for me, the most valuable was just being in the same room with like-minded people who are building the future and who are seeing the future. You know, especially when it comes to AI agents, it's so often I have conversations with friends who are not in the AI world. And it's like so quickly it happens that you, it sounds like you're talking in science fiction. And it's just crazy talk. It was, you know, it's so refreshing to talk with so many other people who already see these things and yeah, be inspired then by them and not always feel like, like, okay, I think I'm just crazy. And like, this will never happen. It really is happening. And for me, it was very valuable. So day two, more relevant, more relevant for you than day one. Yeah. Day two. So day two was the engineering track. Yeah. That was definitely the most valuable for me. Like also as a producer. Practitioner myself, especially there were one or two talks that had to do with voice AI and AI agents with voice. Okay. So that was quite fascinating. Also spoke with the speakers afterwards. Yeah. And yeah, they were also very open and, and, you know, this, this sharing attitudes that's, I think in general, quite prevalent in the AI community. I also learned a lot, like really practical things that I can now take away with me. Yeah. swyx [00:02:25]: I mean, on my side, I, I think I watched only like half of the talks. Cause I was running around and I think people saw me like towards the end, I was kind of collapsing. I was on the floor, like, uh, towards the end because I, I needed to get, to get a rest, but yeah, I'm excited to watch the voice AI talks myself. Kevin [00:02:43]: Yeah. Yeah. Do that. And I mean, from my side, thanks a lot for organizing this conference for bringing everyone together. Do you have anything like this in Switzerland? The short answer is no. Um, I mean, I have to say the AI community in, especially Zurich, where. Yeah. Where we're, where we're based. Yeah. It is quite good. And it's growing, uh, especially driven by ETH, the, the technical university there and all of the big companies, they have AI teams there. Google, like Google has the biggest tech hub outside of the U S in Zurich. Yeah. Facebook is doing a lot in reality labs. Uh, Apple has a secret AI team, open AI and then SwapBit just announced that they're coming to Zurich. Yeah. Um, so there's a lot happening. Yeah. swyx [00:03:23]: So, yeah, uh, I think the most recent notable move, I think the entire vision team from Google. Uh, Lucas buyer, um, and, and all the other authors of Siglip left Google to join open AI, which I thought was like, it's like a big move for a whole team to move all at once at the same time. So I've been to Zurich and it just feels expensive. Like it's a great city. Yeah. It's great university, but I don't see it as like a business hub. Is it a business hub? I guess it is. Right. Kevin [00:03:51]: Like it's kind of, well, historically it's, uh, it's a finance hub, finance hub. Yeah. I mean, there are some, some large banks there, right? Especially UBS, uh, the, the largest wealth manager in the world, but it's really becoming more of a tech hub now with all of the big, uh, tech companies there. swyx [00:04:08]: I guess. Yeah. Yeah. And, but we, and research wise, it's all ETH. Yeah. There's some other things. Yeah. Yeah. Yeah. Kevin [00:04:13]: It's all driven by ETH. And then, uh, it's sister university EPFL, which is in Lausanne. Okay. Um, which they're also doing a lot, but, uh, it's, it's, it's really ETH. Uh, and otherwise, no, I mean, it's a beautiful, really beautiful city. I can recommend. To anyone. To come, uh, visit Zurich, uh, uh, let me know, happy to show you around and of course, you know, you, you have the nature so close, you have the mountains so close, you have so, so beautiful lakes. Yeah. Um, I think that's what makes it such a livable city. Yeah. swyx [00:04:42]: Um, and the cost is not, it's not cheap, but I mean, we're in New York city right now and, uh, I don't know, I paid $8 for a coffee this morning, so, uh, the coffee is cheaper in Zurich than the New York city. Okay. Okay. Let's talk about Snipt. What is Snipt and, you know, then we'll talk about your origin story, but I just, let's, let's get a crisp, what is Snipt? Yeah. Kevin [00:05:03]: I always see two definitions of Snipt, so I'll give you one really simple, straightforward one, and then a second more nuanced, um, which I think will be valuable for the rest of our conversation. So the most simple one is just to say, look, we're an AI powered podcast app. So if you listen to podcasts, we're now providing this AI enhanced experience. But if you look at the more nuanced, uh, podcast. Uh, perspective, it's actually, we, we've have a very big focus on people who like your audience who listened to podcasts to learn something new. Like your audience, you want, they want to learn about AI, what's happening, what's, what's, what's the latest research, what's going on. And we want to provide a, a spoken audio platform where you can do that most effectively. And AI is basically the way that we can achieve that. Yeah. swyx [00:05:53]: Means to an end. Yeah, exactly. When you started. Was it always meant to be AI or is it, was it more about the social sharing? Kevin [00:05:59]: So the first version that we ever released was like three and a half years ago. Okay. Yeah. So this was before ChatGPT. Before Whisper. Yeah. Before Whisper. Yeah. So I think a lot of the features that we now have in the app, they weren't really possible yet back then. But we already from the beginning, we always had the focus on knowledge. That's the reason why, you know, we in our team, why we listen to podcasts, but we did have a bit of a different approach. Like the idea in the very beginning was, so the name is Snips and you can create these, what we call Snips, which is basically a small snippet, like a clip from a, from a podcast. And we did envision sort of like a, like a social TikTok platform where some people would listen to full episodes and they would snip certain, like the best parts of it. And they would post that in a feed and other users would consume this feed of Snips. And use that as a discovery tool or just as a means to an end. And yeah, so you would
While everyone is now repeating that 2025 is the “Year of the Agent”, OpenAI is heads down building towards it. In the first 2 months of the year they released Operator and Deep Research (arguably the most successful agent archetype so far), and today they are bringing a lot of those capabilities to the API: * Responses API * Web Search Tool * Computer Use Tool * File Search Tool * A new open source Agents SDK with integrated Observability Tools We cover all this and more in today’s lightning pod on YouTube! More details here: Responses API In our Michelle Pokrass episode we talked about the Assistants API needing a redesign. Today OpenAI is launching the Responses API, “a more flexible foundation for developers building agentic applications”. It’s a superset of the chat completion API, and the suggested starting point for developers working with OpenAI models. One of the big upgrades is the new set of built-in tools for the responses API: Web Search, Computer Use, and Files. Web Search Tool We previously had Exa AI on the podcast to talk about web search for AI. OpenAI is also now joining the race; the Web Search API is actually a new “model” that exposes two 4o fine-tunes: gpt-4o-search-preview and gpt-4o-mini-search-preview. These are the same models that power ChatGPT Search, and are priced at $30/1000 queries and $25/1000 queries respectively. The killer feature is inline citations: you do not only get a link to a page, but also a deep link to exactly where your query was answered in the result page. Computer Use Tool The model that powers Operator, called Computer-Using-Agent (CUA), is also now available in the API. The computer-use-preview model is SOTA on most benchmarks, achieving 38.1% success on OSWorld for full computer use tasks, 58.1% on WebArena, and 87% on WebVoyager for web-based interactions. As you will notice in the docs, `computer-use-preview` is both a model and a tool through which you can specify the environment. Usage is priced at $3/1M input tokens and $12/1M output tokens, and it’s currently only available to users in tiers 3-5. File Search Tool File Search was also available in the Assistants API, and it’s now coming to Responses too. OpenAI is bringing search + RAG all under one umbrella, and we’ll definitely see more people trying to find new ways to build all-in-one apps on OpenAI. Usage is priced at $2.50 per thousand queries and file storage at $0.10/GB/day, with the first GB free. Agent SDK: Swarms++! https://github.com/openai/openai-agents-python To bring it all together, after the viral reception to Swarm, OpenAI is releasing an officially supported agents framework (which was previewed at our AI Engineer Summit) with 4 core pieces: * Agents: Easily configurable LLMs with clear instructions and built-in tools. * Handoffs: Intelligently transfer control between agents. * Guardrails: Configurable safety checks for input and output validation. * Tracing & Observability: Visualize agent execution traces to debug and optimize performance. Multi-agent workflows are here to stay! OpenAI is now explicitly designs for a set of common agentic patterns: Workflows, Handoffs, Agents-as-Tools, LLM-as-a-Judge, Parallelization, and Guardrails. OpenAI previewed this in part 2 of their talk at NYC: Further coverage of the launch from Kevin Weil, WSJ, and OpenAIDevs, AMA here. Show Notes * Assistants API * Swarm (OpenAI) * Fine-Tuning in AI * 2024 OpenAI DevDay Recap with Romain * Michelle Pokrass episode (API lead) Timestamps * 00:00 Intros * 02:31 Responses API * 08:34 Web Search API * 17:14 Files Search API * 18:46 Files API vs RAG * 20:06 Computer Use / Operator API * 22:30 Agents SDK And of course you can catch up with the full livestream here: Transcript Alessio [00:00:03]: Hey, everyone. Welcome back to another Latent Space Lightning episode. This is Alessio, partner and CTO at Decibel, and I'm joined by Swyx, founder of Small AI. swyx [00:00:11]: Hi, and today we have a super special episode because we're talking with our old friend Roman. Hi, welcome. Romain [00:00:19]: Thank you. Thank you for having me. swyx [00:00:20]: And Nikunj, who is most famously, if anyone has ever tried to get any access to anything on the API, Nikunj is the guy. So I know your emails because I look forward to them. Nikunj [00:00:30]: Yeah, nice to meet all of you. swyx [00:00:32]: I think that we're basically convening today to talk about the new API. So perhaps you guys want to just kick off. What is OpenAI launching today? Nikunj [00:00:40]: Yeah, so I can kick it off. We're launching a bunch of new things today. We're going to do three new built-in tools. So we're launching the web search tool. This is basically chat GPD for search, but available in the API. We're launching an improved file search tool. So this is you bringing your data to OpenAI. You upload it. We, you know, take care of parsing it, chunking it. We're embedding it, making it searchable, give you this like ready vector store that you can use. So that's the file search tool. And then we're also launching our computer use tool. So this is the tool behind the operator product in chat GPD. So that's coming to developers today. And to support all of these tools, we're going to have a new API. So, you know, we launched chat completions, like I think March 2023 or so. It's been a while. So we're looking for an update over here to support all the new things that the models can do. And so we're launching this new API. It is, you know, it works with tools. We think it'll be like a great option for all the future agentic products that we build. And so that is also launching today. Actually, the last thing we're launching is the agents SDK. We launched this thing called Swarm last year where, you know, it was an experimental SDK for people to do multi-agent orchestration and stuff like that. It was supposed to be like educational experimental, but like people, people really loved it. They like ate it up. And so we are like, all right, let's, let's upgrade this thing. Let's give it a new name. And so we're calling it the agents SDK. It's going to have built-in tracing in the OpenAI dashboard. So lots of cool stuff going out. So, yeah. Romain [00:02:14]: That's a lot, but we said 2025 was the year of agents. So there you have it, like a lot of new tools to build these agents for developers. swyx [00:02:20]: Okay. I guess, I guess we'll just kind of go one by one and we'll leave the agents SDK towards the end. So responses API, I think the sort of primary concern that people have and something I think I've voiced to you guys when, when, when I was talking with you in the, in the planning process was, is chat completions going away? So I just wanted to let it, let you guys respond to the concerns that people might have. Romain [00:02:41]: Chat completion is definitely like here to stay, you know, it's a bare metal API we've had for quite some time. Lots of tools built around it. So we want to make sure that it's maintained and people can confidently keep on building on it. At the same time, it was kind of optimized for a different world, right? It was optimized for a pre-multi-modality world. We also optimized for kind of single turn. It takes two problems. It takes prompt in, it takes response out. And now with these agentic workflows, we, we noticed that like developers and companies want to build longer horizon tasks, you know, like things that require multiple returns to get the task accomplished. And computer use is one of those, for instance. And so that's why the responses API came to life to kind of support these new agentic workflows. But chat completion is definitely here to stay. swyx [00:03:27]: And assistance API, we've, uh, has a target sunset date of first half of 2020. So this is kind of like, in my mind, there was a kind of very poetic mirroring of the API with the models. This, I kind of view this as like kind of the merging of assistance API and chat completions, right. Into one unified responses. So it's kind of like how GPT and the old series models are also unifying. Romain [00:03:48]: Yeah, that's exactly the right, uh, that's the right framing, right? Like, I think we took the best of what we learned from the assistance API, especially like being able to access tools very, uh, very like conveniently, but at the same time, like simplifying the way you have to integrate, like, you no longer have to think about six different objects to kind of get access to these tools with the responses API. You just get one API request and suddenly you can weave in those tools, right? Nikunj [00:04:12]: Yeah, absolutely. And I think we're going to make it really easy and straightforward for assistance API users to migrate over to responsive. Right. To the API without any loss of functionality or data. So our plan is absolutely to add, you know, assistant like objects and thread light objects to that, that work really well with the responses API. We'll also add like the code interpreter tool, which is not launching today, but it'll come soon. And, uh, we'll add async mode to responses API, because that's another difference with, with, uh, assistance. I will have web hooks and stuff like that, but I think it's going to be like a pretty smooth transition. Uh, once we have all of that in place. And we'll be. Like a full year to migrate and, and help them through any issues they, they, they face. So overall, I feel like assistance users are really going to benefit from this longer term, uh, with this more flexible, primitive. Alessio [00:05:01]: How should people think about when to use each type of API? So I know that in the past, the assistance was maybe more stateful, kind of like long running, many tool use kind of like file based things. And the chat completions is more stateless, you know, kind of like traditional completion API. Is that still the mental model that people should have? Or like, should yo
Special lightning pod with David Hershey from Anthropic, the person behind Claude Plays Pokémon. Sonnet 3.7 is currently trying to complete Pokémon Red live on Twitch thanks to a special harness that David built so that it can see the screen, navigate through it, remember facts about the game, and more. (Since recording, it has successfully escaped Mt Moon! You can follow along on Twitch: https://www.twitch.tv/claudeplayspokemon) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Today's episode is with Paul Klein, founder of Browserbase. We talked about building browser infrastructure for AI agents, the future of agent authentication, and their open source framework Stagehand. * [00:00:00] Introductions * [00:04:46] AI-specific challenges in browser infrastructure * [00:07:05] Multimodality in AI-Powered Browsing * [00:12:26] Running headless browsers at scale * [00:18:46] Geolocation when proxying * [00:21:25] CAPTCHAs and Agent Auth * [00:28:21] Building “User take over” functionality * [00:33:43] Stagehand: AI web browsing framework * [00:38:58] OpenAI's Operator and computer use agents * [00:44:44] Surprising use cases of Browserbase * [00:47:18] Future of browser automation and market competition * [00:53:11] Being a solo founder Transcript Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. swyx [00:00:12]: Hey, and today we are very blessed to have our friends, Paul Klein, for the fourth, the fourth, CEO of Browserbase. Welcome. Paul [00:00:21]: Thanks guys. Yeah, I'm happy to be here. I've been lucky to know both of you for like a couple of years now, I think. So it's just like we're hanging out, you know, with three ginormous microphones in front of our face. It's totally normal hangout. swyx [00:00:34]: Yeah. We've actually mentioned you on the podcast, I think, more often than any other Solaris tenant. Just because like you're one of the, you know, best performing, I think, LLM tool companies that have started up in the last couple of years. Paul [00:00:50]: Yeah, I mean, it's been a whirlwind of a year, like Browserbase is actually pretty close to our first birthday. So we are one years old. And going from, you know, starting a company as a solo founder to... To, you know, having a team of 20 people, you know, a series A, but also being able to support hundreds of AI companies that are building AI applications that go out and automate the web. It's just been like, really cool. It's been happening a little too fast. I think like collectively as an AI industry, let's just take a week off together. I took my first vacation actually two weeks ago, and Operator came out on the first day, and then a week later, DeepSeat came out. And I'm like on vacation trying to chill. I'm like, we got to build with this stuff, right? So it's been a breakneck year. But I'm super happy to be here and like talk more about all the stuff we're seeing. And I'd love to hear kind of what you guys are excited about too, and share with it, you know? swyx [00:01:39]: Where to start? So people, you've done a bunch of podcasts. I think I strongly recommend Jack Bridger's Scaling DevTools, as well as Turner Novak's The Peel. And, you know, I'm sure there's others. So you covered your Twilio story in the past, talked about StreamClub, you got acquired to Mux, and then you left to start Browserbase. So maybe we just start with what is Browserbase? Yeah. Paul [00:02:02]: Browserbase is the web browser for your AI. We're building headless browser infrastructure, which are browsers that run in a server environment that's accessible to developers via APIs and SDKs. It's really hard to run a web browser in the cloud. You guys are probably running Chrome on your computers, and that's using a lot of resources, right? So if you want to run a web browser or thousands of web browsers, you can't just spin up a bunch of lambdas. You actually need to use a secure containerized environment. You have to scale it up and down. It's a stateful system. And that infrastructure is, like, super painful. And I know that firsthand, because at my last company, StreamClub, I was CTO, and I was building our own internal headless browser infrastructure. That's actually why we sold the company, is because Mux really wanted to buy our headless browser infrastructure that we'd built. And it's just a super hard problem. And I actually told my co-founders, I would never start another company unless it was a browser infrastructure company. And it turns out that's really necessary in the age of AI, when AI can actually go out and interact with websites, click on buttons, fill in forms. You need AI to do all of that work in an actual browser running somewhere on a server. And BrowserBase powers that. swyx [00:03:08]: While you're talking about it, it occurred to me, not that you're going to be acquired or anything, but it occurred to me that it would be really funny if you became the Nikita Beer of headless browser companies. You just have one trick, and you make browser companies that get acquired. Paul [00:03:23]: I truly do only have one trick. I'm screwed if it's not for headless browsers. I'm not a Go programmer. You know, I'm in AI grant. You know, browsers is an AI grant. But we were the only company in that AI grant batch that used zero dollars on AI spend. You know, we're purely an infrastructure company. So as much as people want to ask me about reinforcement learning, I might not be the best guy to talk about that. But if you want to ask about headless browser infrastructure at scale, I can talk your ear off. So that's really my area of expertise. And it's a pretty niche thing. Like, nobody has done what we're doing at scale before. So we're happy to be the experts. swyx [00:03:59]: You do have an AI thing, stagehand. We can talk about the sort of core of browser-based first, and then maybe stagehand. Yeah, stagehand is kind of the web browsing framework. Yeah. What is Browserbase? Headless Browser Infrastructure Explained Alessio [00:04:10]: Yeah. Yeah. And maybe how you got to browser-based and what problems you saw. So one of the first things I worked on as a software engineer was integration testing. Sauce Labs was kind of like the main thing at the time. And then we had Selenium, we had Playbrite, we had all these different browser things. But it's always been super hard to do. So obviously you've worked on this before. When you started browser-based, what were the challenges? What were the AI-specific challenges that you saw versus, there's kind of like all the usual running browser at scale in the cloud, which has been a problem for years. What are like the AI unique things that you saw that like traditional purchase just didn't cover? Yeah. AI-specific challenges in browser infrastructure Paul [00:04:46]: First and foremost, I think back to like the first thing I did as a developer, like as a kid when I was writing code, I wanted to write code that did stuff for me. You know, I wanted to write code to automate my life. And I do that probably by using curl or beautiful soup to fetch data from a web browser. And I think I still do that now that I'm in the cloud. And the other thing that I think is a huge challenge for me is that you can't just create a web site and parse that data. And we all know that now like, you know, taking HTML and plugging that into an LLM, you can extract insights, you can summarize. So it was very clear that now like dynamic web scraping became very possible with the rise of large language models or a lot easier. And that was like a clear reason why there's been more usage of headless browsers, which are necessary because a lot of modern websites don't expose all of their page content via a simple HTTP request. You know, they actually do require you to run this type of code for a specific time. JavaScript on the page to hydrate this. Airbnb is a great example. You go to airbnb.com. A lot of that content on the page isn't there until after they run the initial hydration. So you can't just scrape it with a curl. You need to have some JavaScript run. And a browser is that JavaScript engine that's going to actually run all those requests on the page. So web data retrieval was definitely one driver of starting BrowserBase and the rise of being able to summarize that within LLM. Also, I was familiar with if I wanted to automate a website, I could write one script and that would work for one website. It was very static and deterministic. But the web is non-deterministic. The web is always changing. And until we had LLMs, there was no way to write scripts that you could write once that would run on any website. That would change with the structure of the website. Click the login button. It could mean something different on many different websites. And LLMs allow us to generate code on the fly to actually control that. So I think that rise of writing the generic automation scripts that can work on many different websites, to me, made it clear that browsers are going to be a lot more useful because now you can automate a lot more things without writing. If you wanted to write a script to book a demo call on 100 websites, previously, you had to write 100 scripts. Now you write one script that uses LLMs to generate that script. That's why we built our web browsing framework, StageHand, which does a lot of that work for you. But those two things, web data collection and then enhanced automation of many different websites, it just felt like big drivers for more browser infrastructure that would be required to power these kinds of features. Alessio [00:07:05]: And was multimodality also a big thing? Paul [00:07:08]: Now you can use the LLMs to look, even though the text in the dome might not be as friendly. Maybe my hot take is I was always kind of like, I didn't think vision would be as big of a driver. For UI automation, I felt like, you know, HTML is structured text and large language models are good with structured text. But it's clear that these computer use models are often vision driven, and they've been really pushing things forward. So definitely being multimodal, like rendering the page is required to take a screenshot to give that to a computer use model to take actions on a website. And it's just another win for browser. But I'll be honest, that wasn't what I was thinkin
While “LLM-powered Search” is as old as Perplexity and SearchGPT, and open source projects like GPTResearcher and clones like OpenDeepResearch exist, the difference with “Deep Research” products is they are both “agentic” (loosely meaning that an LLM decides the next step in a workflow, usually involving tools) and bundling custom-tuned frontier models (custom tuned o3 and Gemini 1.5 Flash). The reception to OpenAI’s Deep Research agent has been nothing short of breathless: "Deep Research is the best public-facing AI product Google has ever released. It's like having a college-educated researcher in your pocket." - Jason Calacanis “I have had [Deep Research] write a number of ten-page papers for me, each of them outstanding. I think of the quality as comparable to having a good PhD-level research assistant, and sending that person away with a task for a week or two, or maybe more. Except Deep Research does the work in five or six minutes.” - Tyler Cowen “Deep Research is one of the best bargains in technology.” - Ben Thompson “my very approximate vibe is that it can do a single-digit percentage of all economically valuable tasks in the world, which is a wild milestone.” - sama “Using Deep Research over the past few weeks has been my own personal AGI moment. It takes 10 mins to generate accurate and thorough competitive and market research (with sources) that previously used to take me at least 3 hours.” - OAI employee “It's like a bazooka for the curious mind” - Dan Shipper “Deep research can be seen as a new interface for the internet, in addition to being an incredible agent… This paradigm will be so powerful that in the future, navigating the internet manually via a browser will be "old-school", like performing arithmetic calculations by hand.” - Jason Wei “One notable characteristic of Deep Research is its extreme patience. I think this is rapidly approaching “superhuman patience”. One realization working on this project was that intelligence and patience go really well together.” - HyungWon “I asked it to write a reference Interaction Calculus evaluator in Haskell. A few exchanges later, it gave me a complete file, including a parser, an evaluator, O(1) interactions and everything. The file compiled, and worked on my test inputs. There are some minor issues, but it is mostly correct. So, in about 30 minutes, o3 performed a job that would take me a day or so.” - Victor Taelin “Can confirm OpenAI Deep Research is quite strong. In a few minutes it did what used to take a dozen hours. The implications to knowledge work is going to be quite profound when you just ask an AI Agent to perform full tasks for you and come back with a finished result.” - Aaron Levie “Deep Research is genuinely useful” - Gary Marcus With the advent of “Deep Research” agents, we are now routinely asking models to go through 100+ websites and generate in-depth reports on any topic. The Deep Research revolution has hit the AI scene in the last 2 weeks: * Dec 11th: Gemini Deep Research (today’s guest!) rolls out with Gemini Advanced * Feb 2nd: OpenAI releases Deep Research * Feb 3rd: a dozen “Open Deep Research” clones launch * Feb 5th: Gemini 2.0 Flash GA * Feb 15th: Perplexity launches Deep Research * Feb 17th: xAI launches Deep Search In today’s episode, we welcome Aarush Selvan and Mukund Sridhar, the lead PM and tech lead for Gemini Deep Research, the originators of the entire category. We asked detailed questions from inspiration to implementation, why they had to finetune a special model for it instead of using the standard Gemini model, how to run evals for them, and how to think about the distribution of use cases. (We also have an upcoming Gemini 2 episode with our returning first guest Logan Kilpatrick so stay tuned 👀) Two Kinds of Inference Time Compute In just ~2 months since NeurIPS, we’ve moved from “scaling has hit a wall, LLMs might be over” to “is this AGI already?” thanks to the releases of o1, o3, and DeepSeek R1 (see our o3 post and R1 distillation lightning pod). This new jump in capabilities is now accelerating many other applications; you might remember how “needle in a haystack” was one of the benchmarks people often referenced when looking at model’s capabilities over long context (see our 1M Llama context window ep for more). It seems that we have broken through the “wall” by scaling “inference time” in two meaningful ways — one with more time spent in the model, and the other with more tool calls. Both help build better agents which are clearly more intelligent. But as we discuss on the podcast, we are currently in a “honeymoon” period of agent products where taking more time (or tool calls, or search results) is considered good, because 1) quality is hard to evaluate and 2) we don’t know the realistic upper bound to quality. We know that they’re correlated, but we don’t know to what extent and if the correlation breaks down over extended research periods (they may not). It doesn’t take a PhD to spot the perverse incentives here. Agent UX: From Sync to Async to Hybrid We also discussed the technical challenges in moving from a synchronous “chat” paradigm to the “async” world where every agent builder needs to handroll their own orchestration framework in the background. For now, most simple, first-cut implementations including Gemini and OpenAI and Bolt tend to make “locking” async experiences — while the report is generating or the plan is being executed, you can’t continue chatting with the model or editing the plan. In this case we think the OG Agent here is Devin (now GA), which has gotten it right from the beginning. Full Episode on YouTube with demo! Show Notes * Deep Research * Aarush Selvan * Mukund Sridhar * NotebookLM episode (Raiza / Usama) * Bolt * Bret Taylor Chapters * [00:00:00] Introductions * [00:00:22] Overview + Demo of Deep Research * [00:04:31] Editable chain of thought * [00:08:18] Search ranking for sources * [00:09:31] Can you DIY Deep Research? * [00:15:52] UX and research plan editing * [00:16:21] Follow-up queries and context retention * [00:21:06] Evaluating Deep Research * [00:28:06] Ontology of use cases and research patterns * [00:32:56] User perceptions of latency in Deep Research * [00:40:59] Lessons from other AI products * [00:42:12] Multimodal capabilities * [00:45:02] Technical challenges in Deep Research * [00:51:56] Can Deep Research discover new insights? * [00:54:11] Open challenges in agents * [00:57:04] Wrap up Transcript Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:13]: Hey, and today we're very honored to have in our studio Aarush and Mukund from the Deep Research team, the OG Deep Research team. Welcome. Aarush [00:00:20]: Thanks for having us. Swyx [00:00:22]: Yeah, thanks for making the trip up. I was fortunate enough to be one of the early beta testers of Deep Research when he came out. I would say I was very keen on, I think even at the end of last year, people were already saying it was one of the most exciting agents that was coming out of Google. You know that previously we had on Ryza and Usama from the Novoca LM team. And I think this is an increasing trend that Gemini and Google are shipping interesting user-facing products that use AI. So congrats on your success so far. Yeah, it's been great. Thanks so much for having us here. Yeah. Yeah, thanks for making the trip up. And I'm also excited for your talk that is happening next week. Obviously, we have to talk about what exactly it is, but I'll ask you towards the end. So basically, okay, you know, we have the screen up. Maybe we just start at a high level for people who don't yet know. Like, what is Deep Research? Sure. Aarush [00:01:10]: So Deep Research is a feature where Gemini can act as your personal research assistant to help you learn about any topic that you want more deeply. It's really helpful for those queries. So you want to go from zero to 50 really fast on a new thing. And the way it works is it takes your query, browses the web for about five minutes, and then outputs a research report for you to review and ask follow-up questions. This is one of the first times, you know, something takes about five, six minutes trying to perform your research. So there's a few challenges that brings. Like, you want to make sure you're spending that time in the computer doing what the user wants. So there's some ways of the UX design that we can talk about. As we go through an example, and then there's also challenges in the browsers, the web is super fragmented and being able to plan iteratively and as, as you pass through this noisy information is a challenge by itself. Swyx [00:02:11]: Yeah. This is like the first time sort of Google automating yourself as searching, like you're, you know, you're supposed to be the experts at search, but now you're like meta-searching and like determining the search strategy. Aarush [00:02:22]: Yeah, I think, at least we see it as two different use cases. There are things that, you know, you know exactly what you're looking for and there's a search is still probably, you know, a very, you know, probably one of the best places to go. I think when deep research really shines is there like multiple facets to your question and you spend like a weekend, you know, just opening like 50, 60 tabs and many times I just give up and we wanted to solve that problem and, and give a great starting point for those kinds of journeys. Alessio [00:02:53]: Do we want to start a query so that it runs in the meantime and then we can chat over it? Swyx [00:02:58]: Okay, here's one query that, that we like, we love to test like super niche, random things, like things where there's like no Wikipedia page already
Bundle tickets for AIE Summit NYC have now sold out. You can now sign up for the livestream — where we will be making a big announcement soon. NYC-based readers and Summit attendees should check out the meetups happening around the Summit. 2024 was a very challenging year for AI Hardware. After the buzz of CES last January, 2024 was marked by the meteoric rise and even harder fall of AI Wearables companies like Rabbit and Humane, with an assist from a pre-wallpaper-app MKBHD. Even Friend.com, the first to launch in the AI pendant category, and which spurred Rewind AI to rebrand to Limitless and follow in their footsteps, ended up delaying their wearable ship date and launching an experimental website chatbot version. We have been cautiously excited about this category, keeping tabs on most of the top entrants, including Omi and Compass. However, to date the biggest winner still standing from the AI Wearable wars is Bee AI, founded by today's guests Maria and Ethan. Bee is an always on hardware device with beamforming microphones, 7 day battery life and a mute button, that can be worn as a wristwatch or a clip-on pin, backed by an incredible transcription, diarization and very long context memory processing pipeline that helps you to remember your day, your todos, and even perform actions by operating a virtual cloud phone. This is one of the most advanced, production ready, personal AI agents we've ever seen, so we were excited to be their first podcast appearance. We met Bee when we ran the world's first Personal AI meetup in April last year. As a user of Bee (and not an investor! just a friend!) it’s genuinely been a joy to use, and we were glad to take advantage of the opportunity to ask hard questions about the privacy and legal/ethical side of things as much as the AI and Hardware engineering side of Bee. We hope you enjoy the episode and tune in next Friday for Bee’s first conference talk: Building Perfect Memory. Full YouTube Video Version Watch this for the live demo! Show Notes * Bee Website * Ethan Sutin, Maria de Lourdes Zollo * Bee @ Personal AI Meetup * Buy Bee with Listener Discount Code! Timestamps * 00:00:00 Introductions and overview of Bee Computer * 00:01:58 Personal context and use cases for Bee * 00:03:02 Origin story of Bee and the founders' background * 00:06:56 Evolution from app to hardware device * 00:09:54 Short-term value proposition for users * 00:12:17 Demo of Bee's functionality * 00:17:54 Hardware form factor considerations * 00:22:22 Privacy concerns and legal considerations * 00:30:57 User adoption and reactions to wearing Bee * 00:35:56 CES experience and hardware manufacturing challenges * 00:41:40 Software pipeline and inference costs * 00:53:38 Technical challenges in real-time processing * 00:57:46 Memory and personal context modeling * 01:02:45 Social aspects and agent-to-agent interactions * 01:04:34 Location sharing and personal data exchange * 01:05:11 Personality analysis capabilities * 01:06:29 Hiring and future of always-on AI Transcript Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of SmallAI. swyx [00:00:12]: Hey, and today we are very honored to have in the studio Maria and Ethan from Bee. Maria [00:00:16]: Hi, thank you for having us. swyx [00:00:20]: And you are, I think, the first hardware founders we've had on the podcast. I've been looking to have had a hardware founder, like a wearable hardware, like a wearable hardware founder for a while. I think we're going to have two or three of them this year. And you're the ones that I wear every day. So thank you for making Bee. Thank you for all the feedback and the usage. Yeah, you know, I've been a big fan. You are the speaker gift for the Engineering World's Fair. And let's start from the beginning. What is Bee Computer? Ethan [00:00:52]: Bee Computer is a personal AI system. So you can think of it as AI living alongside you in first person. So it can kind of capture your in real life. So with that understanding can help you in significant ways. You know, the obvious one is memory, but that's that's really just the base kind of use case. So recalling and reflective. I know, Swyx, that you you like the idea of journaling, but you don't but still have some some kind of reflective summary of what you experienced in real life. But it's also about just having like the whole context of a human being and understanding, you know, giving the machine the ability to understand, like, what's going on in your life. Your attitudes, your desires, specifics about your preferences, so that not only can it help you with recall, but then anything that you need it to do, it already knows, like, if you think about like somebody who you've worked with or lived with for a long time, they just know kind of without having to ask you what you would want, it's clear that like, that is the future that personal AI, like, it's just going to be very, you know, the AI is just so much more valuable with personal context. Maria [00:01:58]: I will say that one of the things that we are really passionate is really understanding this. Personal context, because we'll make the AI more useful. Think about like a best friend that know you so well. That's one of the things that we are seeing from the user. They're using from a companion standpoint or professional use cases. There are many ways to use B, but companionship and professional are the ones that we are seeing now more. swyx [00:02:22]: Yeah. It feels so dry to talk about use cases. Yeah. Yeah. Maria [00:02:26]: It's like really like investor question. Like, what kind of use case? Ethan [00:02:28]: We're just like, we've been so broken and trained. But I mean, on the base case, it's just like, don't you want your AI to know everything you've said and like everywhere you've been, like, wouldn't you want that? Maria [00:02:40]: Yeah. And don't stay there and repeat every time, like, oh, this is what I like. You already know that. And you do things for me based on that. That's I think is really cool. swyx [00:02:50]: Great. Do you want to jump into a demo? Do you have any other questions? Alessio [00:02:54]: I want to maybe just cover the origin story. Just how did you two meet? What was the was this the first idea you started working on? Was there something else before? Maria [00:03:02]: I can start. So Ethan and I, we know each other from six years now. He had a company called Squad. And before that was called Olabot and was a personal AI. Yeah, I should. So maybe you should start this one. But yeah, that's how I know Ethan. Like he was pivoting from personal AI to Squad. And there was a co-watching with friends product. I had experience working with TikTok and video content. So I had the pivoting and we launched Squad and was really successful. And at the end. The founders decided to sell that to Twitter, now X. So both of us, we joined X. We launched Twitter Spaces. We launched many other products. And yeah, till then, we basically continue to work together to the start of B. Ethan [00:03:46]: The interesting thing is like this isn't the first attempt at personal AI. In 2016, when I started my first company, it started out as a personal AI company. This is before Transformers, no BERT even like just RNNs. You couldn't really do any convincing dialogue at all. I met Esther, who was my previous co-founder. We both really interested in the idea of like having a machine kind of model or understand a dynamic human. We wanted to make personal AI. This was like more geared towards because we had obviously much limited tools, more geared towards like younger people. So I don't know if you remember in 2016, there was like a brief chatbot boom. It was way premature, but it was when Zuckerberg went up on F8 and yeah, M and like. Yeah. The messenger platform, people like, oh, bots are going to replace apps. It was like for about six months. And then everybody realized, man, these things are terrible and like they're not replacing apps. But it was at that time that we got excited and we're like, we tried to make this like, oh, teach the AI about you. So it was just an app that you kind of chatted with and it would ask you questions and then like give you some feedback. Maria [00:04:53]: But Hugging Face first version was launched at the same time. Yeah, we started it. Ethan [00:04:56]: We started out the same office as Hugging Face because Betaworks was our investor. So they had to think. They had a thing called Bot Camp. Betaworks is like a really cool VC because they invest in out there things. They're like way ahead of everybody else. And like back then it was they had something called Bot Camp. They took six companies and it was us and Hugging Face. And then I think the other four, I'm pretty sure, are dead. But and Hugging Face was the one that really got, you know, I mean, 30% success rate is pretty good. Yeah. But yeah, when we it was, it was like it was just the two founders. Yeah, they were kind of like an AI company in the beginning. It was a chat app for teenagers. A lot of people don't know that Hugging Face was like, hey, friend, how was school? Let's trade selfies. But then, you know, they built the Transformers library, I believe, to help them make their chat app better. And then they open sourced and it was like it blew up. And like they're like, oh, maybe this is the opportunity. And now they're Hugging Face. But anyway, like we were obsessed with it at that time. But then it was clear that there's some people who really love chatting and like answering questions. But it's like a lot of work, like just to kind of manually. Maria [00:06:00]: Yeah. Ethan [00:06:01]: Teach like all these things about you to an AI. Maria [00:06:04]: Yeah, there were some people that were super passionate, for example, teenagers. They really like, for example, to speak ab
If you’re in SF, join us tomorrow for a fun meetup at CodeGen Night! If you’re in NYC, join us for AI Engineer Summit! The Agent Engineering track is now sold out, but 25 tickets remain for AI Leadership and 5 tickets for the workshops. You can see the full schedule of speakers and workshops at https://ai.engineer! It’s exceedingly hard to introduce someone like Bret Taylor. We could recite his Wikipedia page, or his extensive work history through Silicon Valley’s greatest companies, but everyone else already does that. As a podcast by AI engineers for AI engineers, we had the opportunity to do something a little different. We wanted to dig into what Bret sees from his vantage point at the top of our industry for the last 2 decades, and how that explains the rise of the AI Architect at Sierra, the leading conversational AI/CX platform. “Across our customer base, we are seeing a new role emerge - the role of the AI architect. These leaders are responsible for helping define, manage and evolve their company's AI agent over time. They come from a variety of both technical and business backgrounds, and we think that every company will have one or many AI architects managing their AI agent and related experience.” In our conversation, Bret Taylor confirms the Paul Buchheit legend that he rewrote Google Maps in a weekend, armed with only the help of a then-nascent Google Closure Compiler and no other modern tooling. But what we find remarkable is that he was the PM of Maps, not an engineer, though of course he still identifies as one. We find this theme recurring throughout Bret’s career and worldview. We think it is plain as day that AI leadership will have to be hands-on and technical, especially when the ground is shifting as quickly as it is today: “There's a lot of power in combining product and engineering into as few people as possible… few great things have been created by committee.” “If engineering is an order taking organization for product you can sometimes make meaningful things, but rarely will you create extremely well crafted breakthrough products. Those tend to be small teams who deeply understand the customer need that they're solving, who have a maniacal focus on outcomes.” “And I think the reason why is if you look at like software as a service five years ago, maybe you can have a separation of product and engineering because most software as a service created five years ago. I wouldn't say there's like a lot of technological breakthroughs required for most business applications. And if you're making expense reporting software or whatever, it's useful… You kind of know how databases work, how to build auto scaling with your AWS cluster, whatever, you know, it's just, you're just applying best practices to yet another problem. "When you have areas like the early days of mobile development or the early days of interactive web applications, which I think Google Maps and Gmail represent, or now AI agents, you're in this constant conversation with what the requirements of your customers and stakeholders are and all the different people interacting with it and the capabilities of the technology. And it's almost impossible to specify the requirements of a product when you're not sure of the limitations of the technology itself.” This is the first time the difference between technical leadership for “normal” software and for “AI” software was articulated this clearly for us, and we’ll be thinking a lot about this going forward. We left a lot of nuggets in the conversation, so we hope you’ll just dive in with us (and thank Bret for joining the pod!) Full YouTube Please Like and Subscribe :) Timestamps * 00:00:02 Introductions and Bret Taylor's background * 00:01:23 Bret's experience at Stanford and the dot-com era * 00:04:04 The story of rewriting Google Maps backend * 00:11:06 Early days of interactive web applications at Google * 00:15:26 Discussion on product management and engineering roles * 00:21:00 AI and the future of software development * 00:26:42 Bret's approach to identifying customer needs and building AI companies * 00:32:09 The evolution of business models in the AI era * 00:41:00 The future of programming languages and software development * 00:49:38 Challenges in precisely communicating human intent to machines * 00:56:44 Discussion on Artificial General Intelligence (AGI) and its impact * 01:08:51 The future of agent-to-agent communication * 01:14:03 Bret's involvement in the OpenAI leadership crisis * 01:22:11 OpenAI's relationship with Microsoft * 01:23:23 OpenAI's mission and priorities * 01:27:40 Bret's guiding principles for career choices * 01:29:12 Brief discussion on pasta-making * 01:30:47 How Bret keeps up with AI developments * 01:32:15 Exciting research directions in AI * 01:35:19 Closing remarks and hiring at Sierra Transcript [00:02:05] Introduction and Guest Welcome [00:02:05] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host swyx, founder of smol.ai. [00:02:17] swyx: Hey, and today we're super excited to have Bret Taylor join us. Welcome. Thanks for having me. It's a little unreal to have you in the studio. [00:02:25] swyx: I've read about you so much over the years, like even before. Open AI effectively. I mean, I use Google Maps to get here. So like, thank you for everything that you've done. Like, like your story history, like, you know, I think people can find out what your greatest hits have been. [00:02:40] Bret Taylor's Early Career and Education [00:02:40] swyx: How do you usually like to introduce yourself when, you know, you talk about, you summarize your career, like, how do you look at yourself? [00:02:47] Bret: Yeah, it's a great question. You know, we, before we went on the mics here, we're talking about the audience for this podcast being more engineering. And I do think depending on the audience, I'll introduce myself differently because I've had a lot of [00:03:00] corporate and board roles. I probably self identify as an engineer more than anything else though. [00:03:04] Bret: So even when I was. Salesforce, I was coding on the weekends. So I think of myself as an engineer and then all the roles that I do in my career sort of start with that just because I do feel like engineering is sort of a mindset and how I approach most of my life. So I'm an engineer first and that's how I describe myself. [00:03:24] Bret: You majored in computer [00:03:25] swyx: science, like 1998. And, and I was high [00:03:28] Bret: school, actually my, my college degree was Oh, two undergrad. Oh, three masters. Right. That old. [00:03:33] swyx: Yeah. I mean, no, I was going, I was going like 1998 to 2003, but like engineering wasn't as, wasn't a thing back then. Like we didn't have the title of senior engineer, you know, kind of like, it was just. [00:03:44] swyx: You were a programmer, you were a developer, maybe. What was it like in Stanford? Like, what was that feeling like? You know, was it, were you feeling like on the cusp of a great computer revolution? Or was it just like a niche, you know, interest at the time? [00:03:57] Stanford and the Dot-Com Bubble [00:03:57] Bret: Well, I was at Stanford, as you said, from 1998 to [00:04:00] 2002. [00:04:02] Bret: 1998 was near the peak of the dot com bubble. So. This is back in the day where most people that they're coding in the computer lab, just because there was these sun microsystems, Unix boxes there that most of us had to do our assignments on. And every single day there was a. com like buying pizza for everybody. [00:04:20] Bret: I didn't have to like, I got. Free food, like my first two years of university and then the dot com bubble burst in the middle of my college career. And so by the end there was like tumbleweed going to the job fair, you know, it was like, cause it was hard to describe unless you were there at the time, the like level of hype and being a computer science major at Stanford was like, A thousand opportunities. [00:04:45] Bret: And then, and then when I left, it was like Microsoft, IBM. [00:04:49] Joining Google and Early Projects [00:04:49] Bret: And then the two startups that I applied to were VMware and Google. And I ended up going to Google in large part because a woman named Marissa Meyer, who had been a teaching [00:05:00] assistant when I was, what was called a section leader, which was like a junior teaching assistant kind of for one of the big interest. [00:05:05] Bret: Yes. Classes. She had gone there. And she was recruiting me and I knew her and it was sort of felt safe, you know, like, I don't know. I thought about it much, but it turned out to be a real blessing. I realized like, you know, you always want to think you'd pick Google if given the option, but no one knew at the time. [00:05:20] Bret: And I wonder if I'd graduated in like 1999 where I've been like, mom, I just got a job at pets. com. It's good. But you know, at the end I just didn't have any options. So I was like, do I want to go like make kernel software at VMware? Do I want to go build search at Google? And I chose Google. 50, 50 ball. [00:05:36] Bret: I'm not really a 50, 50 ball. So I feel very fortunate in retrospect that the economy collapsed because in some ways it forced me into like one of the greatest companies of all time, but I kind of lucked into it, I think. [00:05:47] The Google Maps Rewrite Story [00:05:47] Alessio: So the famous story about Google is that you rewrote the Google maps back in, in one week after the map quest quest maps acquisition, what was the story there? [00:05:57] Alessio: Is it. Actually true. Is it [00:06:00] being glorified? Like how, how did that come to be? And is there any detail that maybe Paul hasn't shared before? [00:06:06] Bret: It's largely true, but I'll give the color commentary. So it was actually the front end, not the back end, bu
Did you know that adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focused on CodeGen use cases, co-hosted with E2B and Edge AGI; watch E2B’s new workshop and RSVP here! We’re happy to announce that today’s guest Samuel Colvin will be teaching his very first Pydantic AI workshop at the newly announced AI Engineer NYC Workshops day on Feb 22! 25 tickets left. If you’re a Python developer, it’s very likely that you’ve heard of Pydantic. Every month, it’s downloaded >300,000,000 times, making it one of the top 25 PyPi packages. OpenAI uses it in its SDK for structured outputs, it’s at the core of FastAPI, and if you’ve followed our AI Engineer Summit conference, Jason Liu of Instructor has given two great talks about it: “Pydantic is all you need” and “Pydantic is STILL all you need”. Now, Samuel Colvin has raised $17M from Sequoia to turn Pydantic from an open source project to a full stack AI engineer platform with Logfire, their observability platform, and PydanticAI, their new agent framework. Logfire: bringing OTEL to AI OpenTelemetry recently merged Semantic Conventions for LLM workloads which provides standard definitions to track performance like gen_ai.server.time_per_output_token. In Sam’s view at least 80% of new apps being built today have some sort of LLM usage in them, and just like web observability platform got replaced by cloud-first ones in the 2010s, Logfire wants to do the same for AI-first apps. If you’re interested in the technical details, Logfire migrated away from Clickhouse to Datafusion for their backend. We spent some time on the importance of picking open source tools you understand and that you can actually contribute to upstream, rather than the more popular ones; listen in ~43:19 for that part. Agents are the killer app for graphs Pydantic AI is their attempt at taking a lot of the learnings that LangChain and the other early LLM frameworks had, and putting Python best practices into it. At an API level, it’s very similar to the other libraries: you can call LLMs, create agents, do function calling, do evals, etc. They define an “Agent” as a container with a system prompt, tools, structured result, and an LLM. Under the hood, each Agent is now a graph of function calls that can orchestrate multi-step LLM interactions. You can start simple, then move toward fully dynamic graph-based control flow if needed. “We were compelled enough by graphs once we got them right that our agent implementation [...] is now actually a graph under the hood.” Why Graphs? * More natural for complex or multi-step AI workflows. * Easy to visualize and debug with mermaid diagrams. * Potential for distributed runs, or “waiting days” between steps in certain flows. In parallel, you see folks like Emil Eifrem of Neo4j talk about GraphRAG as another place where graphs fit really well in the AI stack, so it might be time for more people to take them seriously. Full Video Episode Like and subscribe! Chapters * 00:00:00 Introductions * 00:00:24 Origins of Pydantic * 00:05:28 Pydantic's AI moment * 00:08:05 Why build a new agents framework? * 00:10:17 Overview of Pydantic AI * 00:12:33 Becoming a believer in graphs * 00:24:02 God Model vs Compound AI Systems * 00:28:13 Why not build an LLM gateway? * 00:31:39 Programmatic testing vs live evals * 00:35:51 Using OpenTelemetry for AI traces * 00:43:19 Why they don't use Clickhouse * 00:48:34 Competing in the observability space * 00:50:41 Licensing decisions for Pydantic and LogFire * 00:51:48 Building Pydantic.run * 00:55:24 Marimo and the future of Jupyter notebooks * 00:57:44 London's AI scene Show Notes * Sam Colvin * Pydantic * Pydantic AI * Logfire * Pydantic.run * Zod * E2B * Arize * Langsmith * Marimo * Prefect * GLA (Google Generative Language API) * OpenTelemetry * Jason Liu * Sebastian Ramirez * Bogomil Balkansky * Hood Chatham * Jeremy Howard * Andrew Lamb Transcript Alessio [00:00:03]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:12]: Good morning. And today we're very excited to have Sam Colvin join us from Pydantic AI. Welcome. Sam, I heard that Pydantic is all we need. Is that true? Samuel [00:00:24]: I would say you might need Pydantic AI and Logfire as well, but it gets you a long way, that's for sure. Swyx [00:00:29]: Pydantic almost basically needs no introduction. It's almost 300 million downloads in December. And obviously, in the previous podcasts and discussions we've had with Jason Liu, he's been a big fan and promoter of Pydantic and AI. Samuel [00:00:45]: Yeah, it's weird because obviously I didn't create Pydantic originally for uses in AI, it predates LLMs. But it's like we've been lucky that it's been picked up by that community and used so widely. Swyx [00:00:58]: Actually, maybe we'll hear it. Right from you, what is Pydantic and maybe a little bit of the origin story? Samuel [00:01:04]: The best name for it, which is not quite right, is a validation library. And we get some tension around that name because it doesn't just do validation, it will do coercion by default. We now have strict mode, so you can disable that coercion. But by default, if you say you want an integer field and you get in a string of 1, 2, 3, it will convert it to 123 and a bunch of other sensible conversions. And as you can imagine, the semantics around it. Exactly when you convert and when you don't, it's complicated, but because of that, it's more than just validation. Back in 2017, when I first started it, the different thing it was doing was using type hints to define your schema. That was controversial at the time. It was genuinely disapproved of by some people. I think the success of Pydantic and libraries like FastAPI that build on top of it means that today that's no longer controversial in Python. And indeed, lots of other people have copied that route, but yeah, it's a data validation library. It uses type hints for the for the most part and obviously does all the other stuff you want, like serialization on top of that. But yeah, that's the core. Alessio [00:02:06]: Do you have any fun stories on how JSON schemas ended up being kind of like the structure output standard for LLMs? And were you involved in any of these discussions? Because I know OpenAI was, you know, one of the early adopters. So did they reach out to you? Was there kind of like a structure output console in open source that people were talking about or was it just a random? Samuel [00:02:26]: No, very much not. So I originally. Didn't implement JSON schema inside Pydantic and then Sebastian, Sebastian Ramirez, FastAPI came along and like the first I ever heard of him was over a weekend. I got like 50 emails from him or 50 like emails as he was committing to Pydantic, adding JSON schema long pre version one. So the reason it was added was for OpenAPI, which is obviously closely akin to JSON schema. And then, yeah, I don't know why it was JSON that got picked up and used by OpenAI. It was obviously very convenient for us. That's because it meant that not only can you do the validation, but because Pydantic will generate you the JSON schema, it will it kind of can be one source of source of truth for structured outputs and tools. Swyx [00:03:09]: Before we dive in further on the on the AI side of things, something I'm mildly curious about, obviously, there's Zod in JavaScript land. Every now and then there is a new sort of in vogue validation library that that takes over for quite a few years and then maybe like some something else comes along. Is Pydantic? Is it done like the core Pydantic? Samuel [00:03:30]: I've just come off a call where we were redesigning some of the internal bits. There will be a v3 at some point, which will not break people's code half as much as v2 as in v2 was the was the massive rewrite into Rust, but also fixing all the stuff that was broken back from like version zero point something that we didn't fix in v1 because it was a side project. We have plans to move some of the basically store the data in Rust types after validation. Not completely. So we're still working to design the Pythonic version of it, in order for it to be able to convert into Python types. So then if you were doing like validation and then serialization, you would never have to go via a Python type we reckon that can give us somewhere between three and five times another three to five times speed up. That's probably the biggest thing. Also, like changing how easy it is to basically extend Pydantic and define how particular types, like for example, NumPy arrays are validated and serialized. But there's also stuff going on. And for example, Jitter, the JSON library in Rust that does the JSON parsing, has SIMD implementation at the moment only for AMD64. So we can add that. We need to go and add SIMD for other instruction sets. So there's a bunch more we can do on performance. I don't think we're going to go and revolutionize Pydantic, but it's going to continue to get faster, continue, hopefully, to allow people to do more advanced things. We might add a binary format like CBOR for serialization for when you'll just want to put the data into a database and probably load it again from Pydantic. So there are some things that will come along, but for the most part, it should just get faster and cleaner. Alessio [00:05:04]: From a focus perspective, I guess, as a founder too, how did you think about the AI interest rising? And then how do you kind of prioritize, okay, this is worth going into more, and we'll talk about Pydantic AI and all of that. What was maybe your early experience with LLAMP, and when did you figure out, okay, this is something we should take seriously and focus more resources on it? Samuel [00:05:28]: I
Sponsorships and tickets for the AI Engineer Summit are selling fast! See the new website with speakers and schedules live! If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you, this Feb 20-22nd in NYC. We’re pleased to share that Karina will be presenting OpenAI’s closing keynote at the AI Engineer Summit. We were fortunate to get some time with her today to introduce some of her work, and hope this serves as nice background for her talk! There are very few early AI careers that have been as impactful as Karina Nguyen’s. After stints at Notion, Square, Dropbox, Primer, the New York Times, and UC Berkeley, She joined Anthropic as employee ~60 and worked on a wide range of research/product roles for Claude 1, 2, and 3. We’ll just let her LinkedIn speak for itself: Now, as Research manager and Post-training lead in Model Behavior at OpenAI, she creates new interaction paradigms for reasoning interfaces and capabilities, like ChatGPT Canvas, Tasks, SimpleQA, streaming chain-of-thought for o1 models, and more via novel synthetic model training. Ideal AI Research+Product Process In the podcast we got a sense of what Karina has found works for her and her team to be as productive as they have been: * Write PRD (Define what you want) * Funding (Get resources) * Prototype Prompted Baseline (See what’s possible) * Write and Run Evals (Get failures to hillclimb) * Model training (Exceed baseline without overfitting) * Bugbash (Find bugs and solve them) * Ship (Get users!) We could turn this into a snazzy viral graphic but really this is all it is. Simple to say, difficult to do well. Hopefully it helps you define your process if you do similar product-research work. Show Notes * Our Reasoning Price War post * Karina LinkedIn, Website, Twitter * OSINT visualization work * Ukraine 3D storytelling * Karina on Claude Artifacts * Karina on Claude 3 Benchmarks * Inspiration for Artifacts / Canvas from early UX work she did on GPT-3 * “i really believe that things like canvas and tasks should and could have happened like 2 yrs ago, idk why we are lagging in the form factors” (tweet) * Our article on prompting o1 vs Karina’s Claude prompting principles * Canvas: https://openai.com/index/introducing-canvas/ * We trained GPT-4o to collaborate as a creative partner. The model knows when to open a canvas, make targeted edits, and fully rewrite. It also understands broader context to provide precise feedback and suggestions. To support this, our research team developed the following core behaviors: * Triggering the canvas for writing and coding * Generating diverse content types * Making targeted edits * Rewriting documents * Providing inline critique We measured progress with over 20 automated internal evaluations. We used novel synthetic data generation techniques, such as distilling outputs from OpenAI o1-preview, to post-train the model for its core behaviors. This approach allowed us to rapidly address writing quality and new user interactions, all without relying on human-generated data. * Tasks: https://www.theverge.com/2025/1/14/24343528/openai-chatgpt-repeating-tasks-agent-ai * * Agents and Operator * What are agents? “Agents are a gradual progression of tasks: starting with one-off actions, moving to collaboration, and ultimately fully trustworthy long-horizon delegation in complex envs like multi-player/multiagents.” (tweet) * tasks and canvas fall within the first two, and we are def. marching towards the third—though the form factor for 3 will take time to develop * Operator/Computer Use Agents * https://openai.com/index/introducing-operator/ * Misc: * Andrew Ng * Prediction: Personal AI Consumer playbook * ChatGPT as generative OS Timestamps * 00:00 Welcome to the Latent Space Podcast * 00:11 Introducing Karina Nguyen * 02:21 Karina's Journey to OpenAI * 04:45 Early Prototypes and Projects * 05:25 Joining Anthropic and Early Work * 07:16 Challenges and Innovations at Anthropic * 11:30 Launching Claude 3 * 21:57 Behavioral Design and Model Personality * 27:37 The Making of ChatGPT Canvas * 34:34 Canvas Update and Initial Impressions * 34:46 Differences Between Canvas and API Outputs * 35:50 Core Use Cases of Canvas * 36:35 Canvas as a Writing Partner * 36:55 Canvas vs. Google Docs and Future Improvements * 37:35 Canvas for Coding and Executing Code * 38:50 Challenges in Developing Canvas * 41:45 Introduction to Tasks * 41:53 Developing and Iterating on Tasks * 46:27 Future Vision for Tasks and Proactive Models * 52:23 Computer Use Agents and Their Potential * 01:00:21 Cultural Differences Between OpenAI and Anthropic * 01:03:46 Call to Action and Final Thoughts Transcript Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my usual co-host, Swyx. swyx [00:00:11]: Hey, and today we're very, very blessed to have Karina Nguyen in the studio. Welcome. Karina [00:00:15]: Nice to meet you. swyx [00:00:16]: We finally made it happen. We finally made it happen. First time we tried this, you were working at a different company, and now we're here. Fortunately, you had some time, so thank you so much for joining us. Yeah, thank you for inviting me. Karina, your website says you lead a research team in OpenAI, creating new interaction paradigms for reasoning interfaces and capabilities like ChatGPT Canvas, and most recently, ChatGPT TAS. I don't know, is that what we're calling it? Streaming chain of thought for O1 models and more via novel synthetic model training. What is this research team? Karina [00:00:45]: Yeah, I need to clarify this a little bit more. I think it changed a lot since the last time we launched. So we launched Canvas, and it was the first project. I was a tech lead, basically, and then I think over time I was trying to refine what my team is, and I feel like it's at the intersection of human-computer interaction, defining what the next interaction paradigms might look like with some of the most recent reasoning models, as well as actually trying to come up with novel methods, how to improve those models for certain tasks if you want to. So for Canvas, for example, one of the most common use cases is basically writing and coding. And we're continually working on, okay, how do we make Canvas coding to go beyond what is possible right now? And that requires us to actually do our own training and coming up with new methods of synthetic data generation. The way I'm thinking about it is that my team is going from very full stack, from training models all the way up to deployment and making sure that we create novel product features that is coherent to what you're doing. So we're really working on that. swyx [00:02:08]: So it's, it's a lot of work to do right now. And I think that's why I think it's such a great opportunity. You know, how could something this big work in like an industrial space and in the things that we're doing, you know, it's a really exciting time for us. And it's just, you know, it's a lot of work, but what I really like about working in digital space is the, you know, the visual space is always the best place to stay. It's not just the skill sets that need to be done. Alessio [00:02:17]: Like we have, like, a lot of things to be done, but like, we've got a lot of different, you know, things to come up with. I know you have some early UX prototypes with GPT-3 as well, and kind of like maybe how that is informed, the way you build products. Karina [00:02:32]: I think my background was mostly like working on computer vision applications for like investigative journalism. Back when I was like at school at Berkeley, and I was working a lot with like Human Rights Center and like investigative journalists from various media. And that's how I learned more about like AI, like with vision transformers. And at that time, I was working with some of the professors at Berkeley AI Research. swyx [00:03:00]: There are some Pulitzer Prize winning professors, right, that teach there? Karina [00:03:04]: No, so it's mostly like was reporting for like teams like the New York Times, like the AP Associated Press. So it was like all in the context of like Human Rights Center. Got it. Yeah. So that was like in computer vision. And then I saw... I saw Crisolo's work around, you know, like interpretability from Google. And that's how I found out about like Anthropic. And at that time, I was just like, I think it was like the year when like Ukraine's war happened. And I was like trying to find a full-time job. And it was kind of like all got distracted. It was like kind of like spring. And I was like very focused on like figuring out like what to do. And then my best option at that time was just like continue my internship. At the New York Times and convert to like full-time. At the New York Times, it was just like working on like mostly like product engineering work around like R&D prototypes, kind of like storytelling features on the mobile experience. So it kind of like storytelling experiences. And like at that time, we were like thinking about like how do we employ like NLP techniques to like scrape some of the archives from the New York Times or something. But then I always wanted to like get into like AI. And like I knew OpenAI for a while, like since I was like, and I was like, I don't know, I don't know. So I kind of like applied to Anthropic just on the website. And I was rejected the first time. But then at that time, they were not hiring for like anything like product engineering or front-end engineering, which was something I was like, at that time, I was like interested in. And then there was like a new opening at Anthropic was like kind of like you are front-end engineer. And so I applied. And that's how my journey began. But like the earlier prototypes was mostly like I used like Clip. swyx [00:05:13]: We'l
One last Gold sponsor slot is available for the AI Engineer Summit in NYC. Our last round of invites is going out soon - apply here - If you are building AI agents or AI eng teams, this will be the single highest-signal conference of the year for you! While the world melts down over DeepSeek, few are talking about the OTHER notable group of former hedge fund traders who pivoted into AI and built a remarkably profitable consumer AI business with a tiny team with incredibly cracked engineering team — Chai Research. In short order they have: * Started a Chat AI company well before Noam Shazeer started Character AI, and outlasted his departure. * Crossed 1m DAU in 2.5 years - William updates us on the pod that they’ve hit 1.4m DAU now, another +40% from a few months ago. Revenue crossed >$22m. * Launched the Chaiverse model crowdsourcing platform - taking 3-4 week A/B testing cycles down to 3-4 hours, and deploying >100 models a week. While they’re not paying million dollar salaries, you can tell they’re doing pretty well for an 11 person startup: The Chai Recipe: Building infra for rapid evals Remember how the central thesis of LMarena (formerly LMsys) is that the only comprehensive way to evaluate LLMs is to let users try them out and pick winners? At the core of Chai is a mobile app that looks like Character AI, but is actually the largest LLM A/B testing arena in the world, specialized on retaining chat users for Chai’s usecases (therapy, assistant, roleplay, etc). It’s basically what LMArena would be if taken very, very seriously at one company (with $1m in prizes to boot): Chai publishes occasional research on how they think about this, including talks at their Palo Alto office: William expands upon this in today’s podcast (34 mins in): Fundamentally, the way I would describe it is when you're building anything in life, you need to be able to evaluate it. And through evaluation, you can iterate, we can look at benchmarks, and we can say the issues with benchmarks and why they may not generalize as well as one would hope in the challenges of working with them. But something that works incredibly well is getting feedback from humans. And so we built this thing where anyone can submit a model to our developer backend, and it gets put in front of 5000 users, and the users can rate it. And we can then have a really accurate ranking of like which model, or users finding more engaging or more entertaining. And it gets, you know, it's at this point now, where every day we're able to, I mean, we evaluate between 20 and 50 models, LLMs, every single day, right. So even though we've got only got a team of, say, five AI researchers, they're able to iterate a huge quantity of LLMs, right. So our team ships, let's just say minimum 100 LLMs a week is what we're able to iterate through. Now, before that moment in time, we might iterate through three a week, we might, you know, there was a time when even doing like five a month was a challenge, right? By being able to change the feedback loops to the point where it's not, let's launch these three models, let's do an A-B test, let's assign, let's do different cohorts, let's wait 30 days to see what the day 30 retention is, which is the kind of the, if you're doing an app, that's like A-B testing 101 would be, do a 30-day retention test, assign different treatments to different cohorts and come back in 30 days. So that's insanely slow. That's just, it's too slow. And so we were able to get that 30-day feedback loop all the way down to something like three hours. In Crowdsourcing the leap to Ten Trillion-Parameter AGI, William describes Chai’s routing as a recommender system, which makes a lot more sense to us than previous pitches for model routing startups: William is notably counter-consensus in a lot of his AI product principles: * No streaming: Chats appear all at once to allow rejection sampling * No voice: Chai actually beat Character AI to introducing voice - but removed it after finding that it was far from a killer feature. * Blending: “Something that we love to do at Chai is blending, which is, you know, it's the simplest way to think about it is you're going to end up, and you're going to pretty quickly see you've got one model that's really smart, one model that's really funny. How do you get the user an experience that is both smart and funny? Well, just 50% of the requests, you can serve them the smart model, 50% of the requests, you serve them the funny model.” (that’s it!) But chief above all is the recommender system. We also referenced Exa CEO Will Bryk’s concept of SuperKnowlege: Full Video version On YouTube. please like and subscribe! Timestamps * 00:00:04 Introductions and background of William Beauchamp * 00:01:19 Origin story of Chai AI * 00:04:40 Transition from finance to AI * 00:11:36 Initial product development and idea maze for Chai * 00:16:29 User psychology and engagement with AI companions * 00:20:00 Origin of the Chai name * 00:22:01 Comparison with Character AI and funding challenges * 00:25:59 Chai's growth and user numbers * 00:34:53 Key inflection points in Chai's growth * 00:42:10 Multi-modality in AI companions and focus on user-generated content * 00:46:49 Chaiverse developer platform and model evaluation * 00:51:58 Views on AGI and the nature of AI intelligence * 00:57:14 Evaluation methods and human feedback in AI development * 01:02:01 Content creation and user experience in Chai * 01:04:49 Chai Grant program and company culture * 01:07:20 Inference optimization and compute costs * 01:09:37 Rejection sampling and reward models in AI generation * 01:11:48 Closing thoughts and recruitment Transcript Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and today we're in the Chai AI office with my usual co-host, Swyx. swyx [00:00:14]: Hey, thanks for having us. It's rare that we get to get out of the office, so thanks for inviting us to your home. We're in the office of Chai with William Beauchamp. Yeah, that's right. You're founder of Chai AI, but previously, I think you're concurrently also running your fund? William [00:00:29]: Yep, so I was simultaneously running an algorithmic trading company, but I fortunately was able to kind of exit from that, I think just in Q3 last year. Yeah, congrats. Yeah, thanks. swyx [00:00:43]: So Chai has always been on my radar because, well, first of all, you do a lot of advertising, I guess, in the Bay Area, so it's working. Yep. And second of all, the reason I reached out to a mutual friend, Joyce, was because I'm just generally interested in the... ...consumer AI space, chat platforms in general. I think there's a lot of inference insights that we can get from that, as well as human psychology insights, kind of a weird blend of the two. And we also share a bit of a history as former finance people crossing over. I guess we can just kind of start it off with the origin story of Chai. William [00:01:19]: Why decide working on a consumer AI platform rather than B2B SaaS? So just quickly touching on the background in finance. Sure. Originally, I'm from... I'm from the UK, born in London. And I was fortunate enough to go study economics at Cambridge. And I graduated in 2012. And at that time, everyone in the UK and everyone on my course, HFT, quant trading was really the big thing. It was like the big wave that was happening. So there was a lot of opportunity in that space. And throughout college, I'd sort of played poker. So I'd, you know, I dabbled as a professional poker player. And I was able to accumulate this sort of, you know, say $100,000 through playing poker. And at the time, as my friends would go work at companies like ChangeStreet or Citadel, I kind of did the maths. And I just thought, well, maybe if I traded my own capital, I'd probably come out ahead. I'd make more money than just going to work at ChangeStreet. swyx [00:02:20]: With 100k base as capital? William [00:02:22]: Yes, yes. That's not a lot. Well, it depends what strategies you're doing. And, you know, there is an advantage. There's an advantage to being small, right? Because there are, if you have a 10... Strategies that don't work in size. Exactly, exactly. So if you have a fund of $10 million, if you find a little anomaly in the market that you might be able to make 100k a year from, that's a 1% return on your 10 million fund. If your fund is 100k, that's 100% return, right? So being small, in some sense, was an advantage. So started off, and the, taught myself Python, and machine learning was like the big thing as well. Machine learning had really, it was the first, you know, big time machine learning was being used for image recognition, neural networks come out, you get dropout. And, you know, so this, this was the big thing that's going on at the time. So I probably spent my first three years out of Cambridge, just building neural networks, building random forests to try and predict asset prices, right, and then trade that using my own money. And that went well. And, you know, if you if you start something, and it goes well, you You try and hire more people. And the first people that came to mind was the talented people I went to college with. And so I hired some friends. And that went well and hired some more. And eventually, I kind of ran out of friends to hire. And so that was when I formed the company. And from that point on, we had our ups and we had our downs. And that was a whole long story and journey in itself. But after doing that for about eight or nine years, on my 30th birthday, which was four years ago now, I kind of took a step back to just evaluate my life, right? This is what one does when one turns 30. You know, I just heard it. I hear you. And, you know, I looked at my 20s and I loved it. It was a really special time. I was really lucky and fortunate to have worked with this amazing team, been successful,
Sponsorships and applications for the AI Engineer Summit in NYC are live! (Speaker CFPs have closed) If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you. Right after Christmas, the Chinese Whale Bros ended 2024 by dropping the last big model launch of the year: DeepSeek v3. Right now on LM Arena, DeepSeek v3 has a score of 1319, right under the full o1 model, Gemini 2, and 4o latest. This makes it the best open weights model in the world in January 2025. There has been a big recent trend in Chinese labs releasing very large open weights models, with TenCent releasing Hunyuan-Large in November and Hailuo releasing MiniMax-Text this week, both over 400B in size. However these extra-large language models are very difficult to serve. Baseten was the first of the Inference neocloud startups to get DeepSeek V3 online, because of their H200 clusters, their close collaboration with the DeepSeek team and early support of SGLang, a relatively new VLLM alternative that is also used at frontier labs like X.ai. Each H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs. We have been close to Baseten since Sarah Guo introduced Amir Haghighat to swyx, and they supported the very first Latent Space Demo Day in San Francisco, which was effectively the trial run for swyx and Alessio to work together! Since then, Philip Kiely also led a well attended workshop on TensorRT LLM at the 2024 World's Fair. We worked with him to get two of their best representatives, Amir and Lead Model Performance Engineer Yineng Zhang, to discuss DeepSeek, SGLang, and everything they have learned running Mission Critical Inference workloads at scale for some of the largest AI products in the world. The Three Pillars of Mission Critical Inference We initially planned to focus the conversation on SGLang, but Amir and Yineng were quick to correct us that the choice of inference framework is only the simplest, first choice of 3 things you need for production inference at scale: “I think it takes three things, and each of them individually is necessary but not sufficient: * Performance at the model level: how fast are you running this one model running on a single GPU, let's say. The framework that you use there can, can matter. The techniques that you use there can matter. The MLA technique, for example, that Yineng mentioned, or the CUDA kernels that are being used. But there's also techniques being used at a higher level, things like speculative decoding with draft models or with Medusa heads. And these are implemented in the different frameworks, or you can even implement it yourself, but they're not necessarily tied to a single framework. But using speculative decoding gets you massive upside when it comes to being able to handle high throughput. But that's not enough. Invariably, that one model running on a single GPU, let's say, is going to get too much traffic that it cannot handle. * Horizontal scaling at the cluster/region level: And at that point, you need to horizontally scale it. That's not an ML problem. That's not a PyTorch problem. That's an infrastructure problem. How quickly do you go from, a single replica of that model to 5, to 10, to 100. And so that's the second, that's the second pillar that is necessary for running these machine critical inference workloads. And what does it take to do that? It takes, some people are like, Oh, You just need Kubernetes and Kubernetes has an autoscaler and that just works. That doesn't work for, for these kinds of mission critical inference workloads. And you end up catching yourself wanting to bit by bit to rebuild those infrastructure pieces from scratch. This has been our experience. * And then going even a layer beyond that, Kubernetes runs in a single. cluster. It's a single cluster. It's a single region tied to a single region. And when it comes to inference workloads and needing GPUs more and more, you know, we're seeing this that you cannot meet the demand inside of a single region. A single cloud's a single region. In other words, a single model might want to horizontally scale up to 200 replicas, each of which is, let's say, 2H100s or 4H100s or even a full node, you run into limits of the capacity inside of that one region. And what we had to build to get around that was the ability to have a single model have replicas across different regions. So, you know, there are models on Baseten today that have 50 replicas in GCP East and, 80 replicas in AWS West and Oracle in London, etc. * Developer experience for Compound AI Systems: The final one is wrapping the power of the first two pillars in a very good developer experience to be able to afford certain workflows like the ones that I mentioned, around multi step, multi model inference workloads, because more and more we're seeing that the market is moving towards those that the needs are generally in these sort of more complex workflows. We think they said it very well. Show Notes * Amir Haghighat, Co-Founder, Baseten * Yineng Zhang, Lead Software Engineer, Model Performance, Baseten Full YouTube Episode Please like and subscribe! Timestamps * 00:00 Introduction and Latest AI Model Launch * 00:11 DeepSeek v3: Specifications and Achievements * 03:10 Latent Space Podcast: Special Guests Introduction * 04:12 DeepSeek v3: Technical Insights * 11:14 Quantization and Model Performance * 16:19 MOE Models: Trends and Challenges * 18:53 Baseten's Inference Service and Pricing * 31:13 Optimization for DeepSeek * 31:45 Three Pillars of Mission Critical Inference Workloads * 32:39 Scaling Beyond Single GPU * 33:09 Challenges with Kubernetes and Infrastructure * 33:40 Multi-Region Scaling Solutions * 35:34 SG Lang: A New Framework * 38:52 Key Techniques Behind SG Lang * 48:27 Speculative Decoding and Performance * 49:54 Future of Fine-Tuning and RLHF * 01:00:28 Baseten's V3 and Industry Trends Baseten’s previous TensorRT LLM workshop: This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Due to overwhelming demand (>15x applications:slots), we are closing CFPs for AI Engineer Summit NYC today. Last call! Thanks, we’ll be reaching out to all shortly! The world’s top AI blogger and friend of every pod, Simon Willison, dropped a monster 2024 recap: Things we learned about LLMs in 2024. Brian of the excellent TechMeme Ride Home pinged us for a connection and a special crossover episode, our first in 2025. The target audience for this podcast is a tech-literate, but non-technical one. You can see Simon’s notes for AI Engineers in his World’s Fair Keynote. Timestamp * 00:00 Introduction and Guest Welcome * 01:06 State of AI in 2025 * 01:43 Advancements in AI Models * 03:59 Cost Efficiency in AI * 06:16 Challenges and Competition in AI * 17:15 AI Agents and Their Limitations * 26:12 Multimodal AI and Future Prospects * 35:29 Exploring Video Avatar Companies * 36:24 AI Influencers and Their Future * 37:12 Simplifying Content Creation with AI * 38:30 The Importance of Credibility in AI * 41:36 The Future of LLM User Interfaces * 48:58 Local LLMs: A Growing Interest * 01:07:22 AI Wearables: The Next Big Thing * 01:10:16 Wrapping Up and Final Thoughts Transcript [00:00:00] Introduction and Guest Welcome [00:00:00] Brian: Welcome to the first bonus episode of the Tech Meme Write Home for the year 2025. I'm your host as always, Brian McCullough. Listeners to the pod over the last year know that I have made a habit of quoting from Simon Willison when new stuff happens in AI from his blog. Simon has been, become a go to for many folks in terms of, you know, Analyzing things, criticizing things in the AI space. [00:00:33] Brian: I've wanted to talk to you for a long time, Simon. So thank you for coming on the show. No, it's a privilege to be here. And the person that made this connection happen is our friend Swyx, who has been on the show back, even going back to the, the Twitter Spaces days but also an AI guru in, in their own right Swyx, thanks for coming on the show also. [00:00:54] swyx (2): Thanks. I'm happy to be on and have been a regular listener, so just happy to [00:01:00] contribute as well. [00:01:00] Brian: And a good friend of the pod, as they say. Alright, let's go right into it. [00:01:06] State of AI in 2025 [00:01:06] Brian: Simon, I'm going to do the most unfair, broad question first, so let's get it out of the way. The year 2025. Broadly, what is the state of AI as we begin this year? [00:01:20] Brian: Whatever you want to say, I don't want to lead the witness. [00:01:22] Simon: Wow. So many things, right? I mean, the big thing is everything's got really good and fast and cheap. Like, that was the trend throughout all of 2024. The good models got so much cheaper, they got so much faster, they got multimodal, right? The image stuff isn't even a surprise anymore. [00:01:39] Simon: They're growing video, all of that kind of stuff. So that's all really exciting. [00:01:43] Advancements in AI Models [00:01:43] Simon: At the same time, they didn't get massively better than GPT 4, which was a bit of a surprise. So that's sort of one of the open questions is, are we going to see huge, but I kind of feel like that's a bit of a distraction because GPT 4, but way cheaper, much larger context lengths, and it [00:02:00] can do multimodal. [00:02:01] Simon: is better, right? That's a better model, even if it's not. [00:02:05] Brian: What people were expecting or hoping, maybe not expecting is not the right word, but hoping that we would see another step change, right? Right. From like GPT 2 to 3 to 4, we were expecting or hoping that maybe we were going to see the next evolution in that sort of, yeah. [00:02:21] Brian: We [00:02:21] Simon: did see that, but not in the way we expected. We thought the model was just going to get smarter, and instead we got. Massive drops in, drops in price. We got all of these new capabilities. You can talk to the things now, right? They can do simulated audio input, all of that kind of stuff. And so it's kind of, it's interesting to me that the models improved in all of these ways we weren't necessarily expecting. [00:02:43] Simon: I didn't know it would be able to do an impersonation of Santa Claus, like a, you know, Talked to it through my phone and show it what I was seeing by the end of 2024. But yeah, we didn't get that GPT 5 step. And that's one of the big open questions is, is that actually just around the corner and we'll have a bunch of GPT 5 class models drop in the [00:03:00] next few months? [00:03:00] Simon: Or is there a limit? [00:03:03] Brian: If you were a betting man and wanted to put money on it, do you expect to see a phase change, step change in 2025? [00:03:11] Simon: I don't particularly for that, like, the models, but smarter. I think all of the trends we're seeing right now are going to keep on going, especially the inference time compute, right? [00:03:21] Simon: The trick that O1 and O3 are doing, which means that you can solve harder problems, but they cost more and it churns away for longer. I think that's going to happen because that's already proven to work. I don't know. I don't know. Maybe there will be a step change to a GPT 5 level, but honestly, I'd be completely happy if we got what we've got right now. [00:03:41] Simon: But cheaper and faster and more capabilities and longer contexts and so forth. That would be thrilling to me. [00:03:46] Brian: Digging into what you've just said one of the things that, by the way, I hope to link in the show notes to Simon's year end post about what, what things we learned about LLMs in 2024. Look for that in the show notes. [00:03:59] Cost Efficiency in AI [00:03:59] Brian: One of the things that you [00:04:00] did say that you alluded to even right there was that in the last year, you felt like the GPT 4 barrier was broken, like IE. Other models, even open source ones are now regularly matching sort of the state of the art. [00:04:13] Simon: Well, it's interesting, right? So the GPT 4 barrier was a year ago, the best available model was OpenAI's GPT 4 and nobody else had even come close to it. [00:04:22] Simon: And they'd been at the, in the lead for like nine months, right? That thing came out in what, February, March of, of 2023. And for the rest of 2023, nobody else came close. And so at the start of last year, like a year ago, the big question was, Why has nobody beaten them yet? Like, what do they know that the rest of the industry doesn't know? [00:04:40] Simon: And today, that I've counted 18 organizations other than GPT 4 who've put out a model which clearly beats that GPT 4 from a year ago thing. Like, maybe they're not better than GPT 4. 0, but that's, that, that, that barrier got completely smashed. And yeah, a few of those I've run on my laptop, which is wild to me. [00:04:59] Simon: Like, [00:05:00] it was very, very wild. It felt very clear to me a year ago that if you want GPT 4, you need a rack of 40, 000 GPUs just to run the thing. And that turned out not to be true. Like the, the, this is that big trend from last year of the models getting more efficient, cheaper to run, just as capable with smaller weights and so forth. [00:05:20] Simon: And I ran another GPT 4 model on my laptop this morning, right? Microsoft 5. 4 just came out. And that, if you look at the benchmarks, it's definitely, it's up there with GPT 4. 0. It's probably not as good when you actually get into the vibes of the thing, but it, it runs on my, it's a 14 gigabyte download and I can run it on a MacBook Pro. [00:05:38] Simon: Like who saw that coming? The most exciting, like the close of the year on Christmas day, just a few weeks ago, was when DeepSeek dropped their DeepSeek v3 model on Hugging Face without even a readme file. It was just like a giant binary blob that I can't run on my laptop. It's too big. But in all of the benchmarks, it's now by far the best available [00:06:00] open, open weights model. [00:06:01] Simon: Like it's, it's, it's beating the, the metalamas and so forth. And that was trained for five and a half million dollars, which is a tenth of the price that people thought it costs to train these things. So everything's trending smaller and faster and more efficient. [00:06:15] Brian: Well, okay. [00:06:16] Challenges and Competition in AI [00:06:16] Brian: I, I kind of was going to get to that later, but let's, let's combine this with what I was going to ask you next, which is, you know, you're talking, you know, Also in the piece about the LLM prices crashing, which I've even seen in projects that I'm working on, but explain Explain that to a general audience, because we hear all the time that LLMs are eye wateringly expensive to run, but what we're suggesting, and we'll come back to the cheap Chinese LLM, but first of all, for the end user, what you're suggesting is that we're starting to see the cost come down sort of in the traditional technology way of Of costs coming down over time, [00:06:49] Simon: yes, but very aggressively. [00:06:51] Simon: I mean, my favorite thing, the example here is if you look at GPT-3, so open AI's g, PT three, which was the best, a developed model in [00:07:00] 2022 and through most of 20 2023. That, the models that we have today, the OpenAI models are a hundred times cheaper. So there was a 100x drop in price for OpenAI from their best available model, like two and a half years ago to today. [00:07:13] Simon: And [00:07:14] Brian: just to be clear, not to train the model, but for the use of tokens and things. Exactly, [00:07:20] Simon: for running prompts through them. And then When you look at the, the really, the top tier model providers right now, I think, are OpenAI, Anthropic, Google, and Meta. And there are a bunch of others that I could list there as well. [00:07:32] Simon: Mistral are very good. The, the DeepSeq and Quen models have got great. There's a whole bunch of providers serving really good models. But even if you just look a
Applications close Monday for the NYC AI Engineer Summit focusing on AI Leadership and Agent Engineering! If you applied, invites should be rolling out shortly. The search landscape is experiencing a fundamental shift. Google built a >$2T company with the “10 blue links” experience, driven by PageRank as the core innovation for ranking. This was a big improvement from the previous directory-based experiences of AltaVista and Yahoo. Almost 4 decades later, Google is now stuck in this links-based experience, especially from a business model perspective. This legacy architecture creates fundamental constraints: * Must return results in ~400 milliseconds * Required to maintain comprehensive web coverage * Tied to keyword-based matching algorithms * Cost structures optimized for traditional indexing As we move from the era of links to the era of answers, the way search works is changing. You’re not showing a user links, but the goal is to provide context to an LLM. This means moving from keyword based search to more semantic understanding of the content: The link prediction objective can be seen as like a neural PageRank because what you're doing is you're predicting the links people share... but it's more powerful than PageRank. It's strictly more powerful because people might refer to that Paul Graham fundraising essay in like a thousand different ways. And so our model learns all the different ways. All of this is now powered by a $5M cluster with 144 H200s: This architectural choice enables entirely new search capabilities: * Comprehensive result sets instead of approximations * Deep semantic understanding of queries * Ability to process complex, natural language requests As search becomes more complex, time to results becomes a variable: People think of searches as like, oh, it takes 500 milliseconds because we've been conditioned... But what if searches can take like a minute or 10 minutes or a whole day, what can you then do? Unlike traditional search engines' fixed-cost indexing, Exa employs a hybrid approach: * Front-loaded compute for indexing and embeddings * Variable inference costs based on query complexity * Mix of owned infrastructure ($5M H200 cluster) and cloud resources Exa sees a lot of competition from products like Perplexity and ChatGPT Search which layer AI on top of traditional search backends, but Exa is betting that true innovation requires rethinking search from the ground up. For example, the recently launched Websets, a way to turn searches into structured output in grid format, allowing you to create lists and databases out of web pages. The company raised a $17M Series A to build towards this mission, so keep an eye out for them in 2025. Chapters * 00:00:00 Introductions * 00:01:12 ExaAI's initial pitch and concept * 00:02:33 Will's background at SpaceX and Zoox * 00:03:45 Evolution of ExaAI (formerly Metaphor Systems) * 00:05:38 Exa's link prediction technology * 00:09:20 Meaning of the name "Exa" * 00:10:36 ExaAI's new product launch and capabilities * 00:13:33 Compute budgets and variable compute products * 00:14:43 Websets as a B2B offering * 00:19:28 How do you build a search engine? * 00:22:43 What is Neural PageRank? * 00:27:58 Exa use cases * 00:35:00 Auto-prompting * 00:38:42 Building agentic search * 00:44:19 Is o1 on the path to AGI? * 00:49:59 Company culture and nap pods * 00:54:52 Economics of AI search and the future of search technology Full YouTube Transcript Please like and subscribe! Show Notes * ExaAI * Web Search Product * Websets * Series A Announcement * Exa Nap Pods * Perplexity AI * Character.AI Transcript Alessio [00:00:00]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:10]: Hey, and today we're in the studio with my good friend and former landlord, Will Bryk. Roommate. How you doing? Will, you're now CEO co-founder of ExaAI, used to be Metaphor Systems. What's your background, your story? Will [00:00:30]: Yeah, sure. So, yeah, I'm CEO of Exa. I've been doing it for three years. I guess I've always been interested in search, whether I knew it or not. Like, since I was a kid, I've always been interested in, like, high-quality information. And, like, you know, even in high school, wanted to improve the way we get information from news. And then in college, built a mini search engine. And then with Exa, like, you know, it's kind of like fulfilling the dream of actually being able to solve all the information needs I wanted as a kid. Yeah, I guess. I would say my entire life has kind of been rotating around this problem, which is pretty cool. Yeah. Swyx [00:00:50]: What'd you enter YC with? Will [00:00:53]: We entered YC with, uh, we are better than Google. Like, Google 2.0. Swyx [00:01:12]: What makes you say that? Like, that's so audacious to come out of the box with. Will [00:01:16]: Yeah, okay, so you have to remember the time. This was summer 2021. And, uh, GPT-3 had come out. Like, here was this magical thing that you could talk to, you could enter a whole paragraph, and it understands what you mean, understands the subtlety of your language. And then there was Google. Uh, which felt like it hadn't changed in a decade, uh, because it really hadn't. And it, like, you would give it a simple query, like, I don't know, uh, shirts without stripes, and it would give you a bunch of results for the shirts with stripes. And so, like, Google could barely understand you, and GBD3 could. And the theory was, what if you could make a search engine that actually understood you? What if you could apply the insights from LLMs to a search engine? And it's really been the same idea ever since. And we're actually a lot closer now, uh, to doing that. Yeah. Alessio [00:01:55]: Did you have any trouble making people believe? Obviously, there's the same element. I mean, YC overlap, was YC pretty AI forward, even 2021, or? Will [00:02:03]: It's nothing like it is today. But, um, uh, there were a few AI companies, but, uh, we were definitely, like, bold. And I think people, VCs generally like boldness, and we definitely had some AI background, and we had a working demo. So there was evidence that we could build something that was going to work. But yeah, I think, like, the fundamentals were there. I think people at the time were talking about how, you know, Google was failing in a lot of ways. And so there was a bit of conversation about it, but AI was not a big, big thing at the time. Yeah. Yeah. Alessio [00:02:33]: Before we jump into Exa, any fun background stories? I know you interned at SpaceX, any Elon, uh, stories? I know you were at Zoox as well, you know, kind of like robotics at Harvard. Any stuff that you saw early that you thought was going to get solved that maybe it's not solved today? Will [00:02:48]: Oh yeah. I mean, lots of things like that. Like, uh, I never really learned how to drive because I believed Elon that self-driving cars would happen. It did happen. And I take them every night to get home. But it took like 10 more years than I thought. Do you still not know how to drive? I know how to drive now. I learned it like two years ago. That would have been great to like, just, you know, Yeah, yeah, yeah. You know? Um, I was obsessed with Elon. Yeah. I mean, I worked at SpaceX because I really just wanted to work at one of his companies. And I remember they had a rule, like interns cannot touch Elon. And, um, that rule actually influenced my actions. Swyx [00:03:18]: Is it, can Elon touch interns? Ooh, like physically? Will [00:03:22]: Or like talk? Physically, physically, yeah, yeah, yeah, yeah. Okay, interesting. He's changed a lot, but, um, I mean, his companies are amazing. Um, Swyx [00:03:28]: What if you beat him at Diablo 2, Diablo 4, you know, like, Ah, maybe. Alessio [00:03:34]: I want to jump into, I know there's a lot of backstory used to be called metaphor system. So, um, and it, you've always been kind of like a prominent company, maybe at least RAI circles in the NSF. Swyx [00:03:45]: I'm actually curious how Metaphor got its initial aura. You launched with like, very little. We launched very little. Like there was, there was this like big splash image of like, this is Aurora or something. Yeah. Right. And then I was like, okay, what this thing, like the vibes are good, but I don't know what it is. And I think, I think it was much more sort of maybe consumer facing than what you are today. Would you say that's true? Will [00:04:06]: No, it's always been about building a better search algorithm, like search, like, just like the vision has always been perfect search. And if you do that, uh, we will figure out the downstream use cases later. It started on this fundamental belief that you could have perfect search over the web and we could talk about what that means. And like the initial thing we released was really just like our first search engine, like trying to get it out there. Kind of like, you know, an open source. So when OpenAI released, uh, ChachBt, like they didn't, I don't know how, how much of a game plan they had. They kind of just wanted to get something out there. Swyx [00:04:33]: Spooky research preview. Will [00:04:34]: Yeah, exactly. And it kind of morphed from a research company to a product company at that point. And I think similarly for us, like we were research, we started as a research endeavor with a, you know, clear eyes that like, if we succeed, it will be a massive business to make out of it. And that's kind of basically what happened. I think there are actually a lot of parallels to, of w between Exa and OpenAI. I often say we're the OpenAI of search. Um, because. Because we're a research company, we're a research startup that does like fundamental research into, uh, making like AGI for search in a, in a way. Uh, and then we have all these like, uh, business products that come out of
Applications for the NYC AI Engineer Summit, focused on Agents at Work, are open! When we first started Latent Space, in the lightning round we’d always ask guests: “What’s your favorite AI product?”. The majority would say Midjourney. The simple UI of prompt → very aesthetic image turned it into a $300M+ ARR bootstrapped business as it rode the first wave of AI image generation. In open source land, StableDiffusion was congregating around AUTOMATIC1111 as the de-facto web UI. Unlike Midjourney, which offered some flags but was mostly prompt-driven, A1111 let users play with a lot more parameters, supported additional modalities like img2img, and allowed users to load in custom models. If you’re interested in some of the SD history, you can look at our episodes with Lexica, Replicate, and Playground. One of the people involved with that community was comfyanonymous, who was also part of the Stability team in 2023, decided to build an alternative called ComfyUI, now one of the fastest growing open source projects in generative images, and is now the preferred partner for folks like Black Forest Labs’s Flux Tools on Day 1. The idea behind it was simple: “Everyone is trying to make easy to use interfaces. Let me try to make a powerful interface that's not easy to use.” Unlike its predecessors, ComfyUI does not have an input text box. Everything is based around the idea of a node: there’s a text input node, a CLIP node, a checkpoint loader node, a KSampler node, a VAE node, etc. While daunting for simple image generation, the tool is amazing for more complex workflows since you can break down every step of the process, and then chain many of them together rather than manually switching between tools. You can also re-start execution halfway instead of from the beginning, which can save a lot of time when using larger models. To give you an idea of some of the new use cases that this type of UI enables: * Sketch something → Generate an image with SD from sketch → feed it into SD Video to animate * Generate an image of an object → Turn into a 3D asset → Feed into interactive experiences * Input audio → Generate audio-reactive videos Their Examples page also includes some of the more common use cases like AnimateDiff, etc. They recently launched the Comfy Registry, an online library of different nodes that users can pull from rather than having to build everything from scratch. The project has >60,000 Github stars, and as the community grows, some of the projects that people build have gotten quite complex: The most interesting thing about Comfy is that it’s not a UI, it’s a runtime. You can build full applications on top of image models simply by using Comfy. You can expose Comfy workflows as an endpoint and chain them together just like you chain a single node. We’re seeing the rise of AI Engineering applied to art. Major Tom’s ComfyUI Resources from the Latent Space Discord Major shoutouts to Major Tom on the LS Discord who is a image generation expert, who offered these pointers: * “best thing about comfy is the fact it supports almost immediately every new thing that comes out - unlike A1111 or forge, which still don't support flux cnet for instance. It will be perfect tool when conflicting nodes will be resolved” * AP Workflows from Alessandro Perili are a nice example of an all-in-one train-evaluate-generate system built atop Comfy * ComfyUI YouTubers to learn from: * @sebastiankamph * @NerdyRodent * @OlivioSarikas * @sedetweiler * @pixaroma * ComfyUI Nodes to check out: * https://github.com/kijai/ComfyUI-IC-Light * https://github.com/MrForExample/ComfyUI-3D-Pack * https://github.com/PowerHouseMan/ComfyUI-AdvancedLivePortrait * https://github.com/pydn/ComfyUI-to-Python-Extension * https://github.com/THtianhao/ComfyUI-Portrait-Maker * https://github.com/ssitu/ComfyUI_NestedNodeBuilder * https://github.com/longgui0318/comfyui-magic-clothing * https://github.com/atmaranto/ComfyUI-SaveAsScript * https://github.com/ZHO-ZHO-ZHO/ComfyUI-InstantID * https://github.com/AIFSH/ComfyUI-FishSpeech * https://github.com/coolzilj/ComfyUI-Photopea * https://github.com/lks-ai/anynode * Sarav: https://www.youtube.com/@mickmumpitz/videos ( applied stuff ) * Sarav: https://www.youtube.com/@latentvision (technical, but infrequent) * look for comfyui node for https://github.com/magic-quill/MagicQuill * “Comfy for Video” resources * Kijai (https://github.com/kijai) pushing out support for Mochi, CogVideoX, AnimateDif, LivePortrait etc * Comfyui node support like LTX https://github.com/Lightricks/ComfyUI-LTXVideo , and HunyuanVideo * FloraFauna AI and Krea.ai * Communities: https://www.reddit.com/r/StableDiffusion/, https://www.reddit.com/r/comfyui/ Full YouTube Episode As usual, you can find the full video episode on our YouTube (and don’t forget to like and subscribe!) Timestamps * 00:00:04 Introduction of hosts and anonymous guest * 00:00:35 Origins of Comfy UI and early Stable Diffusion landscape * 00:02:58 Comfy's background and development of high-res fix * 00:05:37 Area conditioning and compositing in image generation * 00:07:20 Discussion on different AI image models (SD, Flux, etc.) * 00:11:10 Closed source model APIs and community discussions on SD versions * 00:14:41 LoRAs and textual inversion in image generation * 00:18:43 Evaluation methods in the Comfy community * 00:20:05 CLIP models and text encoders in image generation * 00:23:05 Prompt weighting and negative prompting * 00:26:22 Comfy UI's unique features and design choices * 00:31:00 Memory management in Comfy UI * 00:33:50 GPU market share and compatibility issues * 00:35:40 Node design and parameter settings in Comfy UI * 00:38:44 Custom nodes and community contributions * 00:41:40 Video generation models and capabilities * 00:44:47 Comfy UI's development timeline and rise to popularity * 00:48:13 Current state of Comfy UI team and future plans * 00:50:11 Discussion on other Comfy startups and potential text generation support Transcript Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI. swyx [00:00:12]: Hey everyone, we are in the Chroma Studio again, but with our first ever anonymous guest, Comfy Anonymous, welcome. Comfy [00:00:19]: Hello. swyx [00:00:21]: I feel like that's your full name, you just go by Comfy, right? Comfy [00:00:24]: Yeah, well, a lot of people just call me Comfy, even when they know my real name. Hey, Comfy. Alessio [00:00:32]: Swyx is the same. You know, not a lot of people call you Shawn. swyx [00:00:35]: Yeah, you have a professional name, right, that people know you by, and then you have a legal name. Yeah, it's fine. How do I phrase this? I think people who are in the know, know that Comfy is like the tool for image generation and now other multimodality stuff. I would say that when I first got started with Stable Diffusion, the star of the show was Automatic 111, right? And I actually looked back at my notes from 2022-ish, like Comfy was already getting started back then, but it was kind of like the up and comer, and your main feature was the flowchart. Can you just kind of rewind to that moment, that year and like, you know, how you looked at the landscape there and decided to start Comfy? Comfy [00:01:10]: Yeah, I discovered Stable Diffusion in 2022, in October 2022. And, well, I kind of started playing around with it. Yes, I, and back then I was using Automatic, which was what everyone was using back then. And so I started with that because I had, it was when I started, I had no idea like how Diffusion works. I didn't know how Diffusion models work, how any of this works, so. swyx [00:01:36]: Oh, yeah. What was your prior background as an engineer? Comfy [00:01:39]: Just a software engineer. Yeah. Boring software engineer. swyx [00:01:44]: But like any, any image stuff, any orchestration, distributed systems, GPUs? Comfy [00:01:49]: No, I was doing basically nothing interesting. Crud, web development? Yeah, a lot of web development, just, yeah, some basic, maybe some basic like automation stuff. Okay. Just. Yeah, no, like, no big companies or anything. swyx [00:02:08]: Yeah, but like already some interest in automations, probably a lot of Python. Comfy [00:02:12]: Yeah, yeah, of course, Python. But I wasn't actually used to like the Node graph interface before I started Comfy UI. It was just, I just thought it was like, oh, like, what's the best way to represent the Diffusion process in the user interface? And then like, oh, well. Well, like, naturally, oh, this is the best way I've found. And this was like with the Node interface. So how I got started was, yeah, so basic October 2022, just like I hadn't written a line of PyTorch before that. So it's completely new. What happened was I kind of got addicted to generating images. Alessio [00:02:58]: As we all did. Yeah. Comfy [00:03:00]: And then I started. I started experimenting with like the high-res fixed in auto, which was for those that don't know, the high-res fix is just since the Diffusion models back then could only generate that low-resolution. So what you would do, you would generate low-resolution image, then upscale, then refine it again. And that was kind of the hack to generate high-resolution images. I really liked generating. Like higher resolution images. So I was experimenting with that. And so I modified the code a bit. Okay. What happens if I, if I use different samplers on the second pass, I was edited the code of auto. So what happens if I use a different sampler? What happens if I use a different, like a different settings, different number of steps? And because back then the. The high-res fix was very basic, just, so. Yeah. swyx [00:04:05]: Now there's a whole library of just, uh, the upsamplers. Comfy [00:04:08]: I think, I think they added a bunch of, uh, of options to the h
Applications for the 2025 AI Engineer Summit are up, and you can save the date for AIE Singapore in April and AIE World’s Fair 2025 in June. Happy new year, and thanks for 100 great episodes! Please let us know what you want to see/hear for the next 100! Full YouTube Episode with Slides/Charts Like and subscribe and hit that bell to get notifs! Timestamps * 00:00 Welcome to the 100th Episode! * 00:19 Reflecting on the Journey * 00:47 AI Engineering: The Rise and Impact * 03:15 Latent Space Live and AI Conferences * 09:44 The Competitive AI Landscape * 21:45 Synthetic Data and Future Trends * 35:53 Creative Writing with AI * 36:12 Legal and Ethical Issues in AI * 38:18 The Data War: GPU Poor vs. GPU Rich * 39:12 The Rise of GPU Ultra Rich * 40:47 Emerging Trends in AI Models * 45:31 The Multi-Modality War * 01:05:31 The Future of AI Benchmarks * 01:13:17 Pionote and Frontier Models * 01:13:47 Niche Models and Base Models * 01:14:30 State Space Models and RWKB * 01:15:48 Inference Race and Price Wars * 01:22:16 Major AI Themes of the Year * 01:22:48 AI Rewind: January to March * 01:26:42 AI Rewind: April to June * 01:33:12 AI Rewind: July to September * 01:34:59 AI Rewind: October to December * 01:39:53 Year-End Reflections and Predictions Transcript [00:00:00] Welcome to the 100th Episode! [00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host Swyx for the 100th time today. [00:00:12] swyx: Yay, um, and we're so glad that, yeah, you know, everyone has, uh, followed us in this journey. How do you feel about it? 100 episodes. [00:00:19] Alessio: Yeah, I know. [00:00:19] Reflecting on the Journey [00:00:19] Alessio: Almost two years that we've been doing this. We've had four different studios. Uh, we've had a lot of changes. You know, we used to do this lightning round. When we first started that we didn't like, and we tried to change the question. The answer [00:00:32] swyx: was cursor and perplexity. [00:00:34] Alessio: Yeah, I love mid journey. It's like, do you really not like anything else? [00:00:38] Alessio: Like what's, what's the unique thing? And I think, yeah, we, we've also had a lot more research driven content. You know, we had like 3DAO, we had, you know. Jeremy Howard, we had more folks like that. [00:00:47] AI Engineering: The Rise and Impact [00:00:47] Alessio: I think we want to do more of that too in the new year, like having, uh, some of the Gemini folks, both on the research and the applied side. [00:00:54] Alessio: Yeah, but it's been a ton of fun. I think we both started, I wouldn't say as a joke, we were kind of like, Oh, we [00:01:00] should do a podcast. And I think we kind of caught the right wave, obviously. And I think your rise of the AI engineer posts just kind of get people. Sombra to congregate, and then the AI engineer summit. [00:01:11] Alessio: And that's why when I look at our growth chart, it's kind of like a proxy for like the AI engineering industry as a whole, which is almost like, like, even if we don't do that much, we keep growing just because there's so many more AI engineers. So did you expect that growth or did you expect that would take longer for like the AI engineer thing to kind of like become, you know, everybody talks about it today. [00:01:32] swyx: So, the sign of that, that we have won is that Gartner puts it at the top of the hype curve right now. So Gartner has called the peak in AI engineering. I did not expect, um, to what level. I knew that I was correct when I called it because I did like two months of work going into that. But I didn't know, You know, how quickly it could happen, and obviously there's a chance that I could be wrong. [00:01:52] swyx: But I think, like, most people have come around to that concept. Hacker News hates it, which is a good sign. But there's enough people that have defined it, you know, GitHub, when [00:02:00] they launched GitHub Models, which is the Hugging Face clone, they put AI engineers in the banner, like, above the fold, like, in big So I think it's like kind of arrived as a meaningful and useful definition. [00:02:12] swyx: I think people are trying to figure out where the boundaries are. I think that was a lot of the quote unquote drama that happens behind the scenes at the World's Fair in June. Because I think there's a lot of doubt or questions about where ML engineering stops and AI engineering starts. That's a useful debate to be had. [00:02:29] swyx: In some sense, I actually anticipated that as well. So I intentionally did not. Put a firm definition there because most of the successful definitions are necessarily underspecified and it's actually useful to have different perspectives and you don't have to specify everything from the outset. [00:02:45] Alessio: Yeah, I was at um, AWS reInvent and the line to get into like the AI engineering talk, so to speak, which is, you know, applied AI and whatnot was like, there are like hundreds of people just in line to go in. [00:02:56] Alessio: I think that's kind of what enabled me. People, right? Which is what [00:03:00] you kind of talked about. It's like, Hey, look, you don't actually need a PhD, just, yeah, just use the model. And then maybe we'll talk about some of the blind spots that you get as an engineer with the earlier posts that we also had on on the sub stack. [00:03:11] Alessio: But yeah, it's been a heck of a heck of a two years. [00:03:14] swyx: Yeah. [00:03:15] Latent Space Live and AI Conferences [00:03:15] swyx: You know, I was, I was trying to view the conference as like, so NeurIPS is I think like 16, 17, 000 people. And the Latent Space Live event that we held there was 950 signups. I think. The AI world, the ML world is still very much research heavy. And that's as it should be because ML is very much in a research phase. [00:03:34] swyx: But as we move this entire field into production, I think that ratio inverts into becoming more engineering heavy. So at least I think engineering should be on the same level, even if it's never as prestigious, like it'll always be low status because at the end of the day, you're manipulating APIs or whatever. [00:03:51] swyx: But Yeah, wrapping GPTs, but there's going to be an increasing stack and an art to doing these, these things well. And I, you know, I [00:04:00] think that's what we're focusing on for the podcast, the conference and basically everything I do seems to make sense. And I think we'll, we'll talk about the trends here that apply. [00:04:09] swyx: It's, it's just very strange. So, like, there's a mix of, like, keeping on top of research while not being a researcher and then putting that research into production. So, like, people always ask me, like, why are you covering Neuralibs? Like, this is a ML research conference and I'm like, well, yeah, I mean, we're not going to, to like, understand everything Or reproduce every single paper, but the stuff that is being found here is going to make it through into production at some point, you hope. [00:04:32] swyx: And then actually like when I talk to the researchers, they actually get very excited because they're like, oh, you guys are actually caring about how this goes into production and that's what they really really want. The measure of success is previously just peer review, right? Getting 7s and 8s on their um, Academic review conferences and stuff like citations is one metric, but money is a better metric. [00:04:51] Alessio: Money is a better metric. Yeah, and there were about 2200 people on the live stream or something like that. Yeah, yeah. Hundred on the live stream. So [00:05:00] I try my best to moderate, but it was a lot spicier in person with Jonathan and, and Dylan. Yeah, that it was in the chat on YouTube. [00:05:06] swyx: I would say that I actually also created. [00:05:09] swyx: Layen Space Live in order to address flaws that are perceived in academic conferences. This is not NeurIPS specific, it's ICML, NeurIPS. Basically, it's very sort of oriented towards the PhD student, uh, market, job market, right? Like literally all, basically everyone's there to advertise their research and skills and get jobs. [00:05:28] swyx: And then obviously all the, the companies go there to hire them. And I think that's great for the individual researchers, but for people going there to get info is not great because you have to read between the lines, bring a ton of context in order to understand every single paper. So what is missing is effectively what I ended up doing, which is domain by domain, go through and recap the best of the year. [00:05:48] swyx: Survey the field. And there are, like NeurIPS had a, uh, I think ICML had a like a position paper track, NeurIPS added a benchmarks, uh, datasets track. These are ways in which to address that [00:06:00] issue. Uh, there's always workshops as well. Every, every conference has, you know, a last day of workshops and stuff that provide more of an overview. [00:06:06] swyx: But they're not specifically prompted to do so. And I think really, uh, Organizing a conference is just about getting good speakers and giving them the correct prompts. And then they will just go and do that thing and they do a very good job of it. So I think Sarah did a fantastic job with the startups prompt. [00:06:21] swyx: I can't list everybody, but we did best of 2024 in startups, vision, open models. Post transformers, synthetic data, small models, and agents. And then the last one was the, uh, and then we also did a quick one on reasoning with Nathan Lambert. And then the last one, obviously, was the debate that people were very hyped about. [00:06:39] swyx: It was very awkward. And I'm really, really thankful for John Franco, basically, who stepped up to challenge Dylan. Because Dylan was like, yeah, I'll do it. But He was pro scaling. And I think everyone who is like in AI is pro scaling, right? So you need someb
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production! For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver. Our next keynote covers The State of LLM Agents, with the triumphant return of Professor Graham Neubig’s return to the pod (his ICLR episode here!). OpenDevin is now a startup known as AllHands! The renamed OpenHands has done extremely well this year, as they end the year sitting comfortably at number 1 on the hardest SWE-Bench Full leaderboard at 29%, though on the smaller SWE-Bench Verified, they are at 53%, behind Amazon Q, devlo, and OpenAI's self reported o3 results at 71.7%. Many are saying that 2025 is going to be the year of agents, with OpenAI, DeepMind and Anthropic setting their sights on consumer and coding agents, vision based computer-using agents and multi agent systems. There has been so much progress on the practical reliability and applications of agents in all domains, from the huge launch of Cognition AI's Devin this year, to the sleeper hit of Cursor Composer and Codeium's Windsurf Cascade in the IDE arena, to the explosive revenue growth of Stackblitz's Bolt, Lovable, and Vercel's v0, and the unicorn rounds and high profile movements of customer support agents like Sierra (now worth $4 billion) and search agents like Perplexity (now worth $9 billion). We wanted to take a little step back to understand the most notable papers of the year in Agents, and Graham indulged with his list of 8 perennial problems in building agents in 2024. Must-Read Papers for the 8 Problems of Agents * The agent-computer interface: CodeAct: Executable Code Actions Elicit Better LLM Agents. Minimial viable tools: Execution Sandbox, File Editor, Web Browsing * The human-agent interface: Chat UI, GitHub Plugin, Remote runtime, …? * Choosing an LLM: See Evaluation of LLMs as Coding Agents on SWE-Bench at 30x - must understand instructions, tools, code, environment, error recovery * Planning: Single Agent Systems vs Multi Agent (CoAct: A Global-Local Hierarchy for Autonomous Agent Collaboration) - Explicit vs Implicit, Curated vs Generated * Reusable common workflows: SteP: Stacked LLM Policies for Web Actions and Agent Workflow Memory - Manual prompting vs Learning from Experience * Exploration: Agentless: Demystifying LLM-based Software Engineering Agents and BAGEL: Bootstrapping Agents by Guiding Exploration with Language * Search: Tree Search for Language Model Agents - explore paths and rewind * Evaluation: Fast Sanity Checks (miniWoB and Aider) and Highly Realistic (WebArena, SWE-Bench) and SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers Full Talk on YouTube Please like and subscribe! Timestamps * 00:00 Welcome to Latent Space Live at NeurIPS 2024 * 00:29 State of LLM Agents in 2024 * 02:20 Professor Graham Newbig's Insights on Agents * 03:57 Live Demo: Coding Agents in Action * 08:20 Designing Effective Agents * 14:13 Choosing the Right Language Model for Agents * 16:24 Planning and Workflow for Agents * 22:21 Evaluation and Future Predictions for Agents * 25:31 Future of Agent Development * 25:56 Human-Agent Interaction Challenges * 26:48 Expanding Agent Use Beyond Programming * 27:25 Redesigning Systems for Agent Efficiency * 28:03 Accelerating Progress with Agent Technology * 28:28 Call to Action for Open Source Contributions * 30:36 Q&A: Agent Performance and Benchmarks * 33:23 Q&A: Web Agents and Interaction Methods * 37:16 Q&A: Agent Architectures and Improvements * 43:09 Q&A: Self-Improving Agents and Authentication * 47:31 Live Demonstration and Closing Remarks Transcript [00:00:29] State of LLM Agents in 2024 [00:00:29] Speaker 9: Our next keynote covers the state of LLM agents. With the triumphant return of Professor Graham Newbig of CMU and OpenDevon, now a startup known as AllHands. The renamed OpenHands has done extremely well this year, as they end the year sitting comfortably at number one on the hardest SWE Benchful leaderboard at 29%. [00:00:53] Speaker 9: Though, on the smaller SWE bench verified, they are at 53 percent behind Amazon Q [00:01:00] Devlo and OpenAI's self reported O3 results at 71. 7%. Many are saying that 2025 is going to be the year of agents, with OpenAI, DeepMind, and Anthropic setting their sights on consumer and coding agents. Vision based computer using agents and multi agent systems. [00:01:22] Speaker 9: There has been so much progress on the practical reliability and applications of agents in all domains, from the huge launch of Cognition AI's Devon this year, to the sleeper hit of Cursor Composer and recent guest Codium's Windsurf Cascade in the IDE arena. To the explosive revenue growth of recent guests StackBlitz's Bolt, Lovable, and Vercel's vZero. [00:01:44] Speaker 9: And the unicorn rounds and high profile movements of customer support agents like Sierra, now worth 4 billion, and search agents like Perplexity, now worth 9 billion. We wanted to take a little step back to understand the most notable papers of the year in [00:02:00] agents, and Graham indulged with his list of eight perennial problems in building agents. [00:02:06] Speaker 9: As always, don't forget to check our show notes for all the selected best papers of 2024, and for the YouTube link to their talk. Graham's slides were especially popular online, and we are honoured to have him. Watch out and take care! [00:02:20] Professor Graham Newbig's Insights on Agents [00:02:20] Speaker: Okay hi everyone. So I was given the task of talking about agents in 2024, and this is An impossible task because there are so many agents, so many agents in 2024. So this is going to be strongly covered by like my personal experience and what I think is interesting and important, but I think it's an important topic. [00:02:41] Speaker: So let's go ahead. So the first thing I'd like to think about is let's say I gave you you know, a highly competent human, some tools. Let's say I gave you a web browser and a terminal or a file system. And the ability to [00:03:00] edit text or code. What could you do with that? Everything. Yeah. [00:03:07] Speaker: Probably a lot of things. This is like 99 percent of my, you know, daily daily life, I guess. When I'm, when I'm working. So, I think this is a pretty powerful tool set, and I am trying to do, and what I think some other people are trying to do, is come up with agents that are able to, you know, manipulate these things. [00:03:26] Speaker: Web browsing, coding, running code in successful ways. So there was a little bit about my profile. I'm a professor at CMU, chief scientist at All Hands AI, building open source coding agents. I'm maintainer of OpenHands, which is an open source coding agent framework. And I'm also a software developer and I, I like doing lots of coding and, and, you know, shipping new features and stuff like this. [00:03:51] Speaker: So building agents that help me to do this, you know, is kind of an interesting thing, very close to me. [00:03:57] Live Demo: Coding Agents in Action [00:03:57] Speaker: So the first thing I'd like to do is I'd like to try [00:04:00] some things that I haven't actually tried before. If anybody has, you know, tried to give a live demo, you know, this is, you know very, very scary whenever you do it and it might not work. [00:04:09] Speaker: So it might not work this time either. But I want to show you like three things that I typically do with coding agents in my everyday work. I use coding agents maybe five to 10 times a day to help me solve my own problems. And so this is a first one. This is a data science task. Which says I want to create scatter plots that show the increase of the SWE bench score over time. [00:04:34] Speaker: And so I, I wrote a kind of concrete prompt about this. Agents work better with like somewhat concrete prompts. And I'm gonna throw this into open hands and let it work. And I'll, I'll go back to that in a second. Another thing that I do is I create new software. And I, I've been using a [00:05:00] service a particular service. [00:05:01] Speaker: I won't name it for sending emails and I'm not very happy with it. So I want to switch over to this new service called resend. com, which makes it easier to send emails. And so I'm going to ask it to read the docs for the resend. com API and come up with a script that allows me to send emails. The input to the script should be a CSV file and the subject and body should be provided in Jinja2 templates. [00:05:24] Speaker: So I'll start another agent and and try to get it to do that for me. [00:05:35] Speaker: And let's go with the last one. The last one I do is. This is improving existing software and in order, you know, once you write software, you usually don't throw it away. You go in and, like, actually improve it iteratively. This software that I have is something I created without writing any code. [00:05:52] Speaker: It's basically software to monitor how much our our agents are contributing to the OpenHance repository. [00:06:00] And on the, let me make that a little bit bigger, on the left side, I have the number of issues where it like sent a pull request. I have the number of issues where it like sent a pull request, whether it was merged in purple, closed in red, or is still open in green. And so these are like, you know, it's helping us monitor, but one th
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production! For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver. Today, we’re proud to share Loubna’s highly anticipated talk (slides here)! Synthetic Data We called out the Synthetic Data debate at last year’s NeurIPS, and no surprise that 2024 was dominated by the rise of synthetic data everywhere: * Apple’s Rephrasing the Web, Microsoft’s Phi 2-4 and Orca/AgentInstruct, Tencent’s Billion Persona dataset, DCLM, and HuggingFace’s FineWeb-Edu, and Loubna’s own Cosmopedia extended the ideas of synthetic textbook and agent generation to improve raw web scrape dataset quality * This year we also talked to the IDEFICS/OBELICS team at HuggingFace who released WebSight this year, the first work on code-vs-images synthetic data. * We called Llama 3.1 the Synthetic Data Model for its extensive use (and documentation!) of synthetic data in its pipeline, as well as its permissive license. * Nemotron CC and Nemotron-4-340B also made a big splash this year for how they used 20k items of human data to synthesize over 98% of the data used for SFT/PFT. * Cohere introduced Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress observing gains of up to 56.5% improvement in win rates comparing multiple teachers vs the single best teacher model * In post training, AI2’s Tülu3 (discussed by Luca in our Open Models talk) and Loubna’s Smol Talk were also notable open releases this year. This comes in the face of a lot of scrutiny and criticism, with Scale AI as one of the leading voices publishing AI models collapse when trained on recursively generated data in Nature magazine bringing mainstream concerns to the potential downsides of poor quality syndata: Part of the concerns we highlighted last year on low-background tokens are coming to bear: ChatGPT contaminated data is spiking in every possible metric: But perhaps, if Sakana’s AI Scientist pans out this year, we will have mostly-AI AI researchers publishing AI research anyway so do we really care as long as the ideas can be verified to be correct? Smol Models Meta surprised many folks this year by not just aggressively updating Llama 3 and adding multimodality, but also adding a new series of “small” 1B and 3B “on device” models this year, even working on quantized numerics collaborations with Qualcomm, Mediatek, and Arm. It is near unbelievable that a 1B model today can qualitatively match a 13B model of last year: and the minimum size to hit a given MMLU bar has come down roughly 10x in the last year. We have been tracking this proxied by Lmsys Elo and inference price: The key reads this year are: * MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases * Apple Intelligence Foundation Language Models * Hymba: A Hybrid-head Architecture for Small Language Models * Loubna’s SmolLM and SmolLM2: a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters on the pareto efficiency frontier. * and Moondream, which we already covered in the 2024 in Vision talk Full Talk on YouTube please like and subscribe! Timestamps * [00:00:05] Loubna Intro * [00:00:33] The Rise of Synthetic Data Everywhere * [00:02:57] Model Collapse * [00:05:14] Phi, FineWeb, Cosmopedia - Synthetic Textbooks * [00:12:36] DCLM, Nemotron-CC * [00:13:28] Post Training - AI2 Tulu, Smol Talk, Cohere Multilingual Arbitrage * [00:16:17] Smol Models * [00:18:24] On Device Models * [00:22:45] Smol Vision Models * [00:25:14] What's Next Transcript 2024 in Synthetic Data and Smol Models [00:00:00] [00:00:05] Loubna Intro [00:00:05] Speaker: I'm very happy to be here. Thank you for the invitation. So I'm going to be talking about synthetic data in 2024. And then I'm going to be talking about small on device models. So I think the most interesting thing about synthetic data this year is that like now we have it everywhere in the large language models pipeline. [00:00:33] The Rise of Synthetic Data Everywhere [00:00:33] Speaker: I think initially, synthetic data was mainly used just for post training, because naturally that's the part where we needed human annotators. And then after that, we realized that we don't really have good benchmarks to [00:01:00] measure if models follow instructions well, if they are creative enough, or if they are chatty enough, so we also started using LLMs as judges. [00:01:08] Speaker: Thank you. And I think this year and towards the end of last year, we also went to the pre training parts and we started generating synthetic data for pre training to kind of replace some parts of the web. And the motivation behind that is that you have a lot of control over synthetic data. You can control your prompt and basically also the kind of data that you generate. [00:01:28] Speaker: So instead of just trying to filter the web, you could try to get the LLM to generate what you think the best web pages could look like and then train your models on that. So this is how we went from not having synthetic data at all in the LLM pipeline to having it everywhere. And so the cool thing is like today you can train an LLM with like an entirely synthetic pipeline. [00:01:49] Speaker: For example, you can use our Cosmopedia datasets and you can train a 1B model on like 150 billion tokens that are 100 percent synthetic. And those are also of good quality. And then you can [00:02:00] instruction tune the model on a synthetic SFT dataset. You can also do DPO on a synthetic dataset. And then to evaluate if the model is good, you can use. [00:02:07] Speaker: A benchmark that uses LLMs as a judge, for example, MTBench or AlpacaEvil. So I think this is like a really mind blowing because like just a few years ago, we wouldn't think this is possible. And I think there's a lot of concerns about model collapse, and I'm going to talk about that later. But we'll see that like, if we use synthetic data properly and we curate it carefully, that shouldn't happen. [00:02:29] Speaker: And the reason synthetic data is very popular right now is that we have really strong models, both open and closed. It is really cheap and fast to use compared to human annotations, which cost a lot and take a lot of time. And also for open models right now, we have some really good inference frameworks. [00:02:47] Speaker: So if you have enough GPUs, it's really easy to spawn these GPUs and generate like a lot of synthetic data. Some examples are VLM, TGI, and TensorRT. [00:02:57] Model Collapse [00:02:57] Speaker: Now let's talk about the elephant in the room, model [00:03:00] collapse. Is this the end? If you look at the media and all of like, for example, some papers in nature, it's really scary because there's a lot of synthetic data out there in the web. [00:03:09] Speaker: And naturally we train on the web. So we're going to be training a lot of synthetic data. And if model collapse is going to happen, we should really try to take that seriously. And the other issue is that, as I said, we think, a lot of people think the web is polluted because there's a lot of synthetic data. [00:03:24] Speaker: And for example, when we're building fine web datasets here at Guillerm and Hinek, we're interested in like, how much synthetic data is there in the web? So there isn't really a method to properly measure the amount of synthetic data or to save a webpage synthetic or not. But one thing we can do is to try to look for like proxy words, for example, expressions like as a large language model or words like delve that we know are actually generated by chat GPT. [00:03:49] Speaker: We could try to measure the amount of these words in our data system and compare them to the previous years. For example, here, we measured like a, these words ratio in different dumps of common crawl. [00:04:00] And we can see that like the ratio really increased after chat GPT's release. So if we were to say that synthetic data amount didn't change, you would expect this ratio to stay constant, which is not the case. [00:04:11] Speaker: So there's a lot of synthetic data probably on the web, but does this really make models worse? So what we did is we trained different models on these different dumps. And we then computed their performance on popular, like, NLP benchmarks, and then we computed the aggregated score. And surprisingly, you can see that the latest DOMs are actually even better than the DOMs that are before. [00:04:31] Speaker: So if there's some synthetic data there, at least it did not make the model's worse. Yeah, which is really encouraging. So personally, I wouldn't say the web is positive with Synthetic Data. Maybe it's even making it more rich. And the issue with like model collapse is that, for example, those studies, they were done at like a small scale, and you would ask the model to complete, for example, a Wikipedia paragraph, and then you would train it on these new generations, and you would do that every day. [00:04:56] Speaker: iteratively. I think if you do that approach, it's normal to [00:05:00] observe this kind of behavior because the quality is going to be worse because the model is already small. And then if you train it just on its generations, you shouldn't expect it to become better. But what we're really doing here is that we take a mo
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production! Update: see followup discussion on HN and also the YouTube discussion. For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver. Of perennial interest, particularly at academic conferences, is scaled-up architecture research as people hunt for the next Attention Is All You Need. We have many names for them: “efficient models”, “retentive networks”, “subquadratic attention” or “linear attention” but some of them don’t even have any lineage with attention - one of the best papers of this NeurIPS was Sepp Hochreiter’s xLSTM, which has a particularly poetic significance as one of the creators of the LSTM returning to update and challenge the OG language model architecture: So, for lack of a better term, we decided to call this segment “the State of Post-Transformers” and fortunately everyone rolled with it. We are fortunate to have two powerful friends of the pod to give us an update here: * Together AI: with CEO Vipul Ved Prakash and CTO Ce Zhang joining us to talk about how they are building Together together as a quote unquote full stack AI startup, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms, with notable industry contributions from RedPajama v2, Flash Attention 3, Mamba 2, Mixture of Agents, BASED, Sequoia, Evo, Dragonfly, Dan Fu's ThunderKittens and many more research projects this year * Recursal AI: with CEO Eugene Cheah who has helped lead the independent RWKV project while also running Featherless AI. This year, the team has shipped RWKV v5, codenamed Eagle, to 1.5 billion Windows 10 and Windows 11 machines worldwide, to support Microsoft's on-device, energy-usage-sensitive Windows Copilot usecases, and has launched the first updates on RWKV v6, codenamed Finch and GoldFinch. On the morning of Latent Space Live, they also announced QRWKV6, a Qwen 32B model modified with RWKV linear attention layers. We were looking to host a debate between our speakers, but given that both of them were working on post-transformers alternatives Full Talk on Youtube Please like and subscribe! Links All the models and papers they picked: * Earlier Cited Work * Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention * Hungry hungry hippos: Towards language modeling with state space models * Hyena hierarchy: Towards larger convolutional language models * Mamba: Linear-Time Sequence Modeling with Selective State Spaces * S4: Efficiently Modeling Long Sequences with Structured State Spaces * Just Read Twice (Arora et al) * Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the limited memory, recurrent LMs cannot recall and use all the information in long contexts leading to brittle in-context learning (ICL) quality. A key challenge for efficient LMs is selecting what information to store versus discard. In this work, we observe the order in which information is shown to the LM impacts the selection difficulty. * To formalize this, we show that the hardness of information recall reduces to the hardness of a problem called set disjointness (SD), a quintessential problem in communication complexity that requires a streaming algorithm (e.g., recurrent model) to decide whether inputted sets are disjoint. We empirically and theoretically show that the recurrent memory required to solve SD changes with set order, i.e., whether the smaller set appears first in-context. * Our analysis suggests, to mitigate the reliance on data order, we can put information in the right order in-context or process prompts non-causally. Towards that end, we propose: (1) JRT-Prompt, where context gets repeated multiple times in the prompt, effectively showing the model all data orders. This gives 11.0±1.3 points of improvement, averaged across 16 recurrent LMs and the 6 ICL tasks, with 11.9× higher throughput than FlashAttention-2 for generation prefill (length 32k, batch size 16, NVidia H100). We then propose (2) JRT-RNN, which uses non-causal prefix-linear-attention to process prompts and provides 99% of Transformer quality at 360M params., 30B tokens and 96% at 1.3B params., 50B tokens on average across the tasks, with 19.2× higher throughput for prefill than FA2. * Jamba: A 52B Hybrid Transformer-Mamba Language Model * We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. * Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. * This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. * Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. * We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license. * SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers * We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: * (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. * (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. * (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. * (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. * As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. * RWKV: Reinventing RNNs for the Transformer Era * Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. * We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. * Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. * We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks. * LoLCATs: On Low-Rank Linearizing of Large Language Models * Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. * We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all our LS supporters who helped fund the venue and A/V production! For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver. Since Nathan Lambert ( Interconnects ) joined us for the hit RLHF 201 episode at the start of this year, it is hard to overstate how much Open Models have exploded this past year. In 2023 only five names were playing in the top LLM ranks, Mistral, Mosaic's MPT, TII UAE's Falcon, Yi from Kai-Fu Lee's 01.ai, and of course Meta's Llama 1 and 2. This year a whole cast of new open models have burst on the scene, from Google's Gemma and Cohere's Command R, to Alibaba's Qwen and Deepseek models, to LLM 360 and DCLM and of course to the Allen Institute's OLMo, OL MOE, Pixmo, Molmo, and Olmo 2 models. We were honored to host Luca Soldaini, one of the research leads on the Olmo series of models at AI2. Pursuing Open Model research comes with a lot of challenges beyond just funding and access to GPUs and datasets, particularly the regulatory debates this year across Europe, California and the White House. We also were honored to hear from and Sophia Yang, head of devrel at Mistral, who also presented a great session at the AI Engineer World's Fair Open Models track! Full Talk on YouTube Please like and subscribe! Timestamps * 00:00 Welcome to Latent Space Live * 00:12 Recap of 2024: Best Moments and Keynotes * 01:22 Explosive Growth of Open Models in 2024 * 02:04 Challenges in Open Model Research * 02:38 Keynote by Luca Soldani: State of Open Models * 07:23 Significance of Open Source AI Licenses * 11:31 Research Constraints and Compute Challenges * 13:46 Fully Open Models: A New Trend * 27:46 Mistral's Journey and Innovations * 32:57 Interactive Demo: Lachat Capabilities * 36:50 Closing Remarks and Networking Transcript Session3Audio [00:00:00] AI Charlie: Welcome to Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted, and then invited the best speakers in the latent space network to cover each field. [00:00:28] AI Charlie: 200 of you joined us in person throughout the day, with over 2, 200 watching live online. Our next keynote covers the state of open models in 2024, with Luca Soldani and Nathan Lambert of the Allen Institute for AI, with a special appearance from Dr. Sophia Yang of Mistral. Our first hit episode of 2024 was with Nathan Lambert on RLHF 201 back in January. [00:00:57] AI Charlie: Where he discussed both reinforcement learning for language [00:01:00] models and the growing post training and mid training stack with hot takes on everything from constitutional AI to DPO to rejection sampling and also previewed the sea change coming to the Allen Institute. And to Interconnects, his incredible substack on the technical aspects of state of the art AI training. [00:01:18] AI Charlie: We highly recommend subscribing to get access to his Discord as well. It is hard to overstate how much open models have exploded this past year. In 2023, only five names were playing in the top LLM ranks. Mistral, Mosaics MPT, and Gatsby. TII UAE's Falcon, Yi, from Kaifu Lee's 01. ai, And of course, Meta's Lama 1 and 2. [00:01:43] AI Charlie: This year, a whole cast of new open models have burst on the scene. From Google's Jemma and Cohere's Command R, To Alibaba's Quen and DeepSeq models, to LLM360 and DCLM, and of course, to the Allen Institute's OLMO, [00:02:00] OLMOE, PIXMO, MOLMO, and OLMO2 models. Pursuing open model research comes with a lot of challenges beyond just funding and access to GPUs and datasets, particularly the regulatory debates this year across Europe. [00:02:14] AI Charlie: California and the White House. We also were honored to hear from Mistral, who also presented a great session at the AI Engineer World's Fair Open Models track. As always, don't forget to check the show notes for the YouTube link to their talk, as well as their slides. Watch out and take care. [00:02:35] Luca Intro [00:02:35] Luca Soldaini: Cool. Yeah, thanks for having me over. I'm Luca. I'm a research scientist at the Allen Institute for AI. I threw together a few slides on sort of like a recap of like interesting themes in open models for, for 2024. Have about maybe 20, 25 minutes of slides, and then we can chat if there are any questions. [00:02:57] Luca Soldaini: If I can advance to the next slide. [00:03:00] Okay, cool. So I did the quick check of like, to sort of get a sense of like, how much 2024 was different from 2023. So I went on Hugging Face and sort of get, tried to get a picture of what kind of models were released in 2023 and like, what do we get in 2024? [00:03:16] Luca Soldaini: 2023 we get, we got things like both LLAMA 1 and 2, we got Mistral, we got MPT, Falcon models, I think the YI model came in at the end. Tail end of the year. It was a pretty good year. But then I did the same for 2024. And it's actually quite stark difference. You have models that are, you know, reveling frontier level. [00:03:38] Luca Soldaini: Performance of what you can get from closed models from like Quen, from DeepSeq. We got Llama3. We got all sorts of different models. I added our own Olmo at the bottom. There's this growing group of like, Fully open models that I'm going to touch on a little bit later. But you know, just looking at the slides, it feels like 2024 [00:04:00] was just smooth sailing, happy knees, much better than previous year. [00:04:04] Luca Soldaini: And you know, you can plot you can pick your favorite benchmark Or least favorite, I don't know, depending on what point you're trying to make. And plot, you know, your closed model, your open model and sort of spin it in ways that show that, oh, you know open models are much closer to where closed models are today versus to Versus last year where the gap was fairly significant. [00:04:29] Luca Soldaini: So one thing that I think I don't know if I have to convince people in this room, but usually when I give this talks about like open models, there is always like this background question in, in, in people's mind of like, why should we use open models? APIs argument, you know, it's, it's. Just an HTTP request to get output from a, from one of the best model out there. [00:04:53] Luca Soldaini: Why do I have to set up infra and use local models? And there are really like two answer. There is the more [00:05:00] researchy answer for this, which is where it might be. Background lays, which is just research. If you want to do research on language models, research thrives on, on open models, there is like large swath of research on modeling, on how these models behave on evaluation and inference on mechanistic interpretability that could not happen at all if you didn't have open models they're also for AI builders, they're also like. [00:05:30] Luca Soldaini: Good use cases for using local models. You know, you have some, this is like a very not comprehensive slides, but you have things like there are some application where local models just blow closed models out of the water. So like retrieval, it's a very clear example. We might have like constraints like Edge AI applications where it makes sense. [00:05:51] Luca Soldaini: But even just like in terms of like stability, being able to say this model is not changing under the hood. It's, there's plenty of good cases for, [00:06:00] for open models. And the community is just not models. Is I stole this slide from one of the Quent2 announcement blog posts. But it's super cool to see like how much tech exists around open models and serving them on making them efficient and hosting them. [00:06:18] Luca Soldaini: It's pretty cool. And so. It's if you think about like where the term opens come from, comes from like the open source really open models meet the core tenants of, of open, of open source specifically when it comes around collaboration, there is truly a spirit, like through these open models, you can build on top of other people. [00:06:41] Luca Soldaini: innovation. We see a lot of these even in our own work of like, you know, as we iterate in the various versions of Alma it's not just like every time we collect from scratch all the data. No, the first step is like, okay, what are the cool data sources and datasets people have put [00:07:00] together for language model for training? [00:07:01] Luca Soldaini: Or when it comes to like our post training pipeline We one of the steps is you want to do some DPO and you use a lot of outputs of other models to improve your, your preference model. So it's really having like an open sort of ecosystem benefits and accelerates the development of open models. [00:07:23] The Definition of Open Models [00:07:23] Luca Soldaini: One thing that we got in 2024, which is not a specific model, but I thought it was really significant, is we first got we got our first open source AI definition. So this is from the open source initiative they've been generally the steward of a lot of the open source licenses when it comes to software and so they embarked on this journey in trying to figure out, okay, How does a license, an open source license for a model look like? [00:07:52] Luca Soldaini: Majority of the work is very dry because
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024! We want to express our deepest appreciation to event sponsors AWS, Daylight Computer, Thoth.ai, StrongCompute, Notable Capital, and most of all all our LS supporters who helped fund the gorgeous venue and A/V production! For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver. The single most requested domain was computer vision, and we could think of no one better to help us recap 2024 than our friends at Roboflow, who was one of our earliest guests in 2023 and had one of this year’s top episodes in 2024 again. Roboflow has since raised a $40m Series B! Links Their slides are here: All the trends and papers they picked: * Isaac Robinson * Sora (see our Video Diffusion pod) - extending diffusion from images to video * SAM 2: Segment Anything in Images and Videos (see our SAM2 pod) - extending prompted masks to full video object segmentation * DETR Dominancy: DETRs show Pareto improvement over YOLOs * RT-DETR: DETRs Beat YOLOs on Real-time Object Detection * LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection * D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement * Peter Robicheaux * MMVP (Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs) * * Florence 2 (Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks) * PalíGemma / PaliGemma 2 * PaliGemma: A versatile 3B VLM for transfer * PaliGemma 2: A Family of Versatile VLMs for Transfer * AlMv2 (Multimodal Autoregressive Pre-training of Large Vision Encoders) * Vik Korrapati - Moondream Full Talk on YouTube Want more content like this? Like and subscribe to stay updated on our latest talks, interviews, and podcasts. Transcript/Timestamps [00:00:00] Intro [00:00:05] AI Charlie: welcome to Latent Space Live, our first mini conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co host. When we were thinking of ways to add value to our academic conference coverage, we realized that there was a lack of good talks, just recapping the best of 2024, going domain by domain. [00:00:36] AI Charlie: We sent out a survey to the over 900 of you. who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field. 200 of you joined us in person throughout the day, with over 2, 200 watching live online. Our second featured keynote is The Best of Vision 2024, with Peter Robichaud and Isaac [00:01:00] Robinson of Roboflow, with a special appearance from Vic Corrapati of Moondream. [00:01:05] AI Charlie: When we did a poll of our attendees, the highest interest domain of the year was vision. And so our first port of call was our friends at Roboflow. Joseph Nelson helped us kickstart our vision coverage in episode 7 last year, and this year came back as a guest host with Nikki Ravey of Meta to cover segment Anything 2. [00:01:25] AI Charlie: Roboflow have consistently been the leaders in open source vision models and tooling. With their SuperVision library recently eclipsing PyTorch's Vision library. And Roboflow Universe hosting hundreds of thousands of open source vision datasets and models. They have since announced a 40 million Series B led by Google Ventures. [00:01:46] AI Charlie: Woohoo. [00:01:48] Isaac's picks [00:01:48] Isaac Robinson: Hi, we're Isaac and Peter from Roboflow, and we're going to talk about the best papers of 2024 in computer vision. So, for us, we defined best as what made [00:02:00] the biggest shifts in the space. And to determine that, we looked at what are some major trends that happened and what papers most contributed to those trends. [00:02:09] Isaac Robinson: So I'm going to talk about a couple trends, Peter's going to talk about a trend, And then we're going to hand it off to Moondream. So, the trends that I'm interested in talking about are These are a major transition from models that run on per image basis to models that run using the same basic ideas on video. [00:02:28] Isaac Robinson: And then also how debtors are starting to take over the real time object detection scene from the YOLOs, which have been dominant for years. [00:02:37] Sora, OpenSora and Video Vision vs Generation [00:02:37] Isaac Robinson: So as a highlight we're going to talk about Sora, which from my perspective is the biggest paper of 2024, even though it came out in February. Is the what? [00:02:48] Isaac Robinson: Yeah. Yeah. So just it's a, SORA is just a a post. So I'm going to fill it in with details from replication efforts, including open SORA and related work, such as a stable [00:03:00] diffusion video. And then we're also going to talk about SAM2, which applies the SAM strategy to video. And then how debtors, These are the improvements in 2024 to debtors that are making them a Pareto improvement to YOLO based models. [00:03:15] Isaac Robinson: So to start this off, we're going to talk about the state of the art of video generation at the end of 2023, MagVIT MagVIT is a discrete token, video tokenizer akin to VQ, GAN, but applied to video sequences. And it actually outperforms state of the art handcrafted video compression frameworks. [00:03:38] Isaac Robinson: In terms of the bit rate versus human preference for quality and videos generated by autoregressing on these discrete tokens generate some pretty nice stuff, but up to like five seconds length and, you know, not super detailed. And then suddenly a few months later we have this, which when I saw it, it was totally mind blowing to me. [00:03:59] Isaac Robinson: 1080p, [00:04:00] a whole minute long. We've got light reflecting in puddles. That's reflective. Reminds me of those RTX demonstrations for next generation video games, such as Cyberpunk, but with better graphics. You can see some issues in the background if you look closely, but they're kind of, as with a lot of these models, the issues tend to be things that people aren't going to pay attention to unless they're looking for. [00:04:24] Isaac Robinson: In the same way that like six fingers on a hand. You're not going to notice is a giveaway unless you're looking for it. So yeah, as we said, SORA does not have a paper. So we're going to be filling it in with context from the rest of the computer vision scene attempting to replicate these efforts. So the first step, you have an LLM caption, a huge amount of videos. [00:04:48] Isaac Robinson: This, this is a trick that they introduced in Dolly 3, where they train a image captioning model to just generate very high quality captions for a huge corpus and then train a diffusion model [00:05:00] on that. Their Sora and their application efforts also show a bunch of other steps that are necessary for good video generation. [00:05:09] Isaac Robinson: Including filtering by aesthetic score and filtering by making sure the videos have enough motion. So they're not just like kind of the generators not learning to just generate static frames. So. Then we encode our video into a series of space time latents. Once again, SORA, very sparse in details. [00:05:29] Isaac Robinson: So the replication related works, OpenSORA actually uses a MAG VIT V2 itself to do this, but swapping out the discretization step with a classic VAE autoencoder framework. They show that there's a lot of benefit from getting the temporal compression, which makes a lot of sense as the Each sequential frames and videos have mostly redundant information. [00:05:53] Isaac Robinson: So by compressing against, compressing in the temporal space, you allow the latent to hold [00:06:00] a lot more semantic information while avoiding that duplicate. So, we've got our spacetime latents. Possibly via, there's some 3D VAE, presumably a MAG VATV2 and then you throw it into a diffusion transformer. [00:06:19] Isaac Robinson: So I think it's personally interesting to note that OpenSORA is using a MAG VATV2, which originally used an autoregressive transformer decoder to model the latent space, but is now using a diffusion diffusion transformer. So it's still a transformer happening. Just the question is like, is it? [00:06:37] Isaac Robinson: Parameterizing the stochastic differential equation is, or parameterizing a conditional distribution via autoregression. It's also it's also worth noting that most diffusion models today, the, the very high performance ones are switching away from the classic, like DDPM denoising diffusion probability modeling framework to rectified flows. [00:06:57] Isaac Robinson: Rectified flows have a very interesting property that as [00:07:00] they converge, they actually get closer to being able to be sampled with a single step. Which means that in practice, you can actually generate high quality samples much faster. Major problem of DDPM and related models for the past four years is just that they require many, many steps to generate high quality samples. [00:07:22] Isaac Robinson: So, and naturally, the third step is throwing lots of compute at the problem. So I didn't, I never figured out how to manage to get this video to loop, but we see very little compute, medium compute, lots of compute. This is so interesting because the the original diffusion transformer paper from Facebook actually showed that, in fact, the specific hyperparameters of the transformer didn't really matter that much. [00:07:48] Isaac Robinson: What mattered was that you were just increasing the amount of compute that the model had. So, I love how in the, once again, little blog posts, they don't even talk about [00:08:00] lik
Happy holidays! We’ll be sharing snippets from Latent Space LIVE! through the break bringing you the best of 2024 from friends of the pod! For NeurIPS last year we did our standard conference podcast coverage interviewing selected papers (that we have now also done for ICLR and ICML), however we felt that we could be doing more to help AI Engineers 1) get more industry-relevant content, and 2) recap 2024 year in review from experts. As a result, we organized the first Latent Space LIVE!, our first in person miniconference, at NeurIPS 2024 in Vancouver. For our opening keynote, we could think of no one better to cover 'The State of AI Startups' than our friend Sarah Guo (AI superinvestor, founder of Conviction, host of No Priors!) and Pranav Reddy (Conviction partner) to share their takes on how the AI landscape evolved in 2024 examine the evolving AI landscape and what it means for startups, enterprises, and the industry as a whole! They completely understood the assignment. Recorded live with 200+ in-person and 2200+ online attendees at NeurIPS 2024, this keynote kicks off our mini-conference series exploring different domains of AI development in 2024. Enjoy! Links Slides: https://x.com/saranormous/status/1866933642401886707 Sarh Guo: https://x.com/saranormous Pranav Reddy: https://x.com/prnvrdy Full Video on YouTube Want more content like this? Like and subscribe to stay updated on our latest talks, interviews, and podcasts. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Our second podcast guest ever in March 2023 was Varun Mohan, CEO of Codeium; at the time, they had around 10,000 users and how they vowed to keep their autocomplete free forever: Today, over a million developers use their products, they still have their free tier, and they recently launched Windsurf, an AI IDE. Chapters * 00:00:00: Introductions & Catchup * 00:03:52: Why they created Windsurf * 00:05:52: Limitations of VS Code * 00:10:12: Evaluation methods for Cascade and Windsurf * 00:16:15: Listener questions about Windsurf launch * 00:20:30: Remote execution and security concerns * 00:25:18: Evolution of Codeium's strategy * 00:28:29: Cascade and its capabilities * 00:33:12: Multi-agent systems * 00:37:02: Areas of improvement for Windsurf * 00:39:12: Building an enterprise-first company * 00:42:01: Copilot for X, AI UX, and Enterprise AI blog posts This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver! We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application). Sora, Genie, and the field of Generative Video World Simulators Bill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode: * William (Bill) Peebles - SORA (slides) Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”. We also recommend these reads from throughout 2024 on Sora. * Lilian Weng’s literature review of Video Diffusion Models * Sora API leak * Estimates of 100k-700k H100s needed to serve Sora (not Turbo) * Artist guides on using Sora for professional storytelling Google DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for: * Genie: Generative Interactive Environments (covered in oral, poster, and workshop) * VideoPoet: A Large Language Model for Zero-Shot Video Generation (see website) We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale. Part 2: Generative Modeling and Diffusion Since 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk: * Wading through the noise: an intuitive look at diffusion models Then we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion: Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta) And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast * NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models * Speech Self-Supervised Learning Using Diffusion Model Synthetic Data Part 3: Vision The ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”. Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma. We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. Part 4: Reinforcement Learning and Robotics We segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone. Brittany highlighted two poster session papers: * Behavior Generation with Latent Actions * We also recommend Lerrel Pinto’s On Building General-Purpose Robots * PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs However we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on * "What robots have taught me about machine learning" * developing robot generalists * robots that adapt autonomously * how to give feedback to your language model * special mention to PI colleague Sergey Levine on Robotic Foundation Models We end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL. Timestamps * [00:00:00] Intros * [00:02:43] Sora - Bill Peebles * [00:44:52] Genie: Generative Interactive Environments * [01:00:17] Genie interview * [01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation * [01:30:51] VideoPoet interview - Dan Kondratyuk * [01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale. * [02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models * [03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors * [03:30:30] Ricky Chen - Flow Matching * [04:00:03] Patrick Esser - Stable Diffusion 3 * [04:14:30] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models * [04:27:00] Speech Self-Supervised Learning Using Diffusion Model Synthetic Data * [04:39:00] ICML Test of Time winner: DeCAF * [05:03:40] Lucas Beyer: “Vision in the age of LLMs — a data-centric perspective” * [05:42:00] Ashley Edwards: Learning actions, policies, rewards, and environments from videos alone. * [06:03:30] Behavior Generation with Latent Actions interview * [06:09:52] Chelsea Finn: "What robots have taught me about machine learning" * [06:56:00] Position: Automatic Environment Shaping is the Next Frontier in RL This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
The full schedule for Latent Space LIVE! at NeurIPS has been announced, featuring Best of 2024 overview talks for the AI Startup Landscape, Computer Vision, Open Models, Transformers Killers, Synthetic Data, Agents, and Scaling, and speakers from Sarah Guo of Conviction, Roboflow, AI2/Meta, Recursal/Together, HuggingFace, OpenHands and SemiAnalysis. Join us for the IRL event/Livestream! Alessio will also be holding a meetup at AWS Re:Invent in Las Vegas this Wednesday. See our new Events page for dates of AI Engineer Summit, Singapore, and World’s Fair in 2025. LAST CALL for questions for our big 2024 recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! When we first observed that GPT Wrappers are Good, Actually, we did not even have Bolt on our radar. Since we recorded our Anthropic episode discussing building Agents with the new Claude 3.5 Sonnet, Bolt.new (by Stackblitz) has easily cleared the $8m ARR bar, repeating and accelerating its initial $4m feat. There are very many AI code generators and VS Code forks out there, but Bolt probably broke through initially because of its incredible zero shot low effort app generation: But as we explain in the pod, Bolt also emphasized deploy (Netlify)/ backend (Supabase)/ fullstack capabilities on top of Stackblitz’s existing WebContainer full-WASM-powered-developer-environment-in-the-browser tech. Since then, the team has been shipping like mad (with weekly office hours), with bugfixing, full screen, multi-device, long context, diff based edits (using speculative decoding like we covered in Inference, Fast and Slow). All of this has captured the imagination of low/no code builders like Greg Isenberg and many others on YouTube/TikTok/Reddit/X/Linkedin etc: Just as with Fireworks, our relationship with Bolt/Stackblitz goes a bit deeper than normal - swyx advised the launch and got a front row seat to this epic journey, as well as demoed it with Realtime Voice at the recent OpenAI Dev Day. So we are very proud to be the first/closest to tell the full open story of Bolt/Stackblitz! Flow Engineering + Qodo/AlphaCodium Update In year 2 of the pod we have been on a roll getting former guests to return as guest cohosts (Harrison Chase, Aman Sanger, Jon Frankle), and it was a pleasure to catch Itamar Friedman back on the pod, giving us an update on all things Qodo and Testing Agents from our last catchup a year and a half ago: Qodo (they renamed in September) went viral in early January this year with AlphaCodium (paper here, code here) beating DeepMind’s AlphaCode with high efficiency: With a simple problem solving code agent: * The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details. * Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output. * The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness. * Then, it generates more diverse tests for the problem, covering cases not part of the original public tests. * Iteratively, pick a solution, generate the code, and run it on a few test cases. * If the tests fail, improve the code and repeat the process until the code passes every test. swyx has previously written similar thoughts on types vs tests for putting bounds on program behavior, but AlphaCodium extends this to AI generated tests and code. More recently, Itamar has also shown that AlphaCodium’s techniques also extend well to the o1 models: Making Flow Engineering a useful technique to improve code model performance on every model. This is something we see AI Engineers uniquely well positioned to do compared to ML Engineers/Researchers. Full Video Podcast Like and subscribe! Show Notes * Itamar * Qodo * First episode * Eric * Bolt * StackBlitz * Thinkster * AlphaCodium * WebContainers Chapters * 00:00:00 Introductions & Updates * 00:06:01 Generic vs. Specific AI Agents * 00:07:40 Maintaining vs Creating with AI * 00:17:46 Human vs Agent Computer Interfaces * 00:20:15 Why Docker doesn't work for Bolt * 00:24:23 Creating Testing and Code Review Loops * 00:28:07 Bolt's Task Breakdown Flow * 00:31:04 AI in Complex Enterprise Environments * 00:41:43 AlphaCodium * 00:44:39 Strategies for Breaking Down Complex Tasks * 00:45:22 Building in Open Source * 00:50:35 Choosing a product as a founder * 00:59:03 Reflections on Bolt Success * 01:06:07 Building a B2C GTM * 01:18:11 AI Capabilities and Pricing Tiers * 01:20:28 What makes Bolt unique * 01:23:07 Future Growth and Product Development * 01:29:06 Competitive Landscape in AI Engineering * 01:30:01 Advice to Founders and Embracing AI * 01:32:20 Having a baby and completing an Iron Man Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:12]: Hey, and today we're still in our sort of makeshift in-between studio, but we're very delighted to have a former returning guest host, Itamar. Welcome back. Itamar [00:00:21]: Great to be here after a year or more. Yeah, a year and a half. Swyx [00:00:24]: You're one of our earliest guests on Agents. Now you're CEO co-founder of Kodo. Right. Which has just been renamed. You also raised a $40 million Series A, and we can get caught up on everything, but we're also delighted to have our new guest, Eric. Welcome. Eric [00:00:42]: Thank you. Excited to be here. Should I say Bolt or StackBlitz? Swyx [00:00:45]: Like, is it like its own company now or? Eric [00:00:47]: Yeah. Bolt's definitely bolt.new. That's the thing that we're probably the most known for, I imagine, at this point. Swyx [00:00:54]: Which is ridiculous to say because you were working at StackBlitz for so long. Eric [00:00:57]: Yeah. I mean, within a week, we were doing like double the amount of traffic. And StackBlitz had been online for seven years, and we were like, what? But anyways, yeah. So we're StackBlitz, the company behind bolt.new. If you've heard of bolt.new, that's our stuff. Yeah. Swyx [00:01:12]: Yeah. Itamar [00:01:13]: Excellent. I see, by the way, that the founder mode, you need to know to capture opportunities. So kudos on doing that, right? You're working on some technology, and then suddenly you can exploit that to a new world. Yeah. Eric [00:01:24]: Totally. And I think, well, not to jump, but 100%, I mean, a couple of months ago, we had the idea for Bolt earlier this year, but we haven't really shared this too much publicly. But we actually had tried to build it with some of those state-of-the-art models back in January, February, you can kind of imagine which, and they just weren't good enough to actually do the code generation where the code was accurate and it was fast and whatever have you without a ton of like rag, but then there was like issues with that. So we put it on the shelf and then we got kind of a sneak peek of some of the new models that have come out in the past couple of months now. And so once we saw that, once we actually saw the code gen from it, we were like, oh my God, like, okay, we can build a product around this. And so that was really the impetus of us building the thing. But with that, it was StackBlitz, the core StackBlitz product the past seven years has been an IDE for developers. So the entire user experience flow we've built up just didn't make sense. And so when we kind of went out to build Bolt, we just thought, you know, if we were inventing our product today, what would the interface look like given what is now possible with the AI code gen? And so there's definitely a lot of conversations we had internally, but you know, just kind of when we logically laid it out, we were like, yeah, I think it makes sense to just greenfield a new thing and let's see what happens. If it works great, then we'll figure it out. If it doesn't work great, then it'll get deleted at some point. So that's kind of how it actually came to be. Swyx [00:02:49]: I'll mention your background a little bit. You were also founder of Thinkster before you started StackBlitz. So both of you are second time founders. Both of you have sort of re-founded your company recently. Yours was more of a rename. I think a slightly different direction as well. And then we can talk about both. Maybe just chronologically, should we get caught up on where Kodo is first and then you know, just like what people should know since the last pod? Sure. Itamar [00:03:12]: The last pod was two months after we launched and we basically had the vision that we talked about. The idea that software development is about specification, test and code, etc. We are more on the testing part as in essence, we think that if you solve testing, you solve software development. The beautiful chart that we'll put up on screen. And testing is a really big field, like there are many dimensions, unit testing, the level of the component, how big it is, how large it is. And then there is like different type of testing, is it regression or smoke or whatever. So back then we only had like one ID extension with unit tests as in focus. One and a half year later, first ID extension supports more type of testing as context aware. We index local, local repos, but also 10,000s of repos for Fortune 500 companies. We have another agent, another tool that is called, the pure agent is the open source and the commercial one is CodoMerge. And then we have another open source called CoverAgent, which is not yet a commercial product coming very soon. It's very impressive. It could be that already people are approving automated pull requests that they don't even aware in really big open sources. So once we have enough of these, we will also launch another agent. So for the fir
We have announced our first speaker, friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month. Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors: Solving SWE-Bench As part of the October Sonnet release, Anthropic teased a blink-and-you’ll miss it result: The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor. This was followed up by a blogpost a week later from today’s guest, Erik Schluntz, the engineer who implemented and scored this SOTA result using a simple, non-overengineered version of the SWE-Agent framework (you can see the submissions here). We have previously covered the SWE-Bench story extensively: * Speaking with SWEBench/SWEAgent authors at ICLR * Speaking with Cosine Genie, the previous SOTA (43.8%) on SWEBench Verified (with brief update at DevDay 2024) * Speaking with Shunyu Yao on SWEBench and the ReAct paradigm driving SWE-Agent One of the notable inclusions in this blogpost are the tools that Erik decided to give Claude, e.g. the “Edit Tool”: The tools teased in the SWEBench submission/blogpost were then polished up and released with Computer Use… And you can also see even more computer use tools given in the new Model Context Protocol servers: Claude Computer Use Because it is one of the best received AI releases of the year, we recommend watching the 2 minute Computer Use intro (and related demos) in its entirety: Eric also worked on Claude’s function calling, tool use, and computer use APIs, so we discuss that in the episode. Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero. As you’ll see, this is very top of mind for Erik as a former Robotics founder who’s company basically used robots to interface with human physical systems like elevators. Full Video episode Please like and subscribe! Show Notes * Eric Schluntz * “Raising the bar on SWE-Bench Verified” * Cobalt Robotics * SWE-Bench * SWE-Bench Verified * Human Eval & other benchmarks * Anthropic Workbench * Aider * Cursor * Fireworks AI * E2B * Amanda Askell * Toyota Research * Physical Intelligence (Pi) * Chelsea Finn * Josh Albrecht * Eric Jang * 1X * Dust * Cosine Episode * Bolt * Adept Episode * TauBench * LMSys Episode Timestamps * [00:00:00] Introductions * [00:03:39] What is SWE-Bench? * [00:12:22] SWE-Bench vs HumanEval vs others * [00:15:21] SWE-Agent architecture and runtime * [00:21:18] Do you need code indexing? * [00:24:50] Giving the agent tools * [00:27:47] Sandboxing for coding agents * [00:29:16] Why not write tests? * [00:30:31] Redesigning engineering tools for LLMs * [00:35:53] Multi-agent systems * [00:37:52] Why XML so good? * [00:42:57] Thoughts on agent frameworks * [00:45:12] How many turns can an agent do? * [00:47:12] Using multiple model types * [00:51:40] Computer use and agent use cases * [00:59:04] State of AI robotics * [01:04:24] Robotics in manufacturing * [01:05:01] Hardware challenges in robotics * [01:09:21] Is self-driving a good business? Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners. And today we're in the new studio with my usual co-host, Shawn from Smol AI. Swyx [00:00:14]: Hey, and today we're very blessed to have Erik Schluntz from Anthropic with us. Welcome. Erik [00:00:19]: Hi, thanks very much. I'm Erik Schluntz. I'm a member of technical staff at Anthropic, working on tool use, computer use, and Swebench. Swyx [00:00:27]: Yeah. Well, how did you get into just the whole AI journey? I think you spent some time at SpaceX as well? Yeah. And robotics. Yeah. There's a lot of overlap between like the robotics people and the AI people, and maybe like there's some interlap or interest between language models for robots right now. Maybe just a little bit of background on how you got to where you are. Yeah, sure. Erik [00:00:50]: I was at SpaceX a long time ago, but before joining Anthropic, I was the CTO and co-founder of Cobalt Robotics. We built security and inspection robots. These are sort of five foot tall robots that would patrol through an office building or a warehouse looking for anything out of the ordinary. Very friendly, no tasers or anything. We would just sort of call a remote operator if we saw anything. We have about 100 of those out in the world, and had a team of about 100. We actually got acquired about six months ago, but I had left Cobalt about a year ago now, because I was starting to get a lot more excited about AI. I had been writing a lot of my code with things like Copilot, and I was like, wow, this is actually really cool. If you had told me 10 years ago that AI would be writing a lot of my code, I would say, hey, I think that's AGI. And so I kind of realized that we had passed this level, like, wow, this is actually really useful for engineering work. That got me a lot more excited about AI and learning about large language models. So I ended up taking a sabbatical and then doing a lot of reading and research myself and decided, hey, I want to go be at the core of this and joined Anthropic. Alessio [00:01:53]: And why Anthropic? Did you consider other labs? Did you consider maybe some of the robotics companies? Erik [00:02:00]: So I think at the time I was a little burnt out of robotics, and so also for the rest of this, any sort of negative things I say about robotics or hardware is coming from a place of burnout, and I reserve my right to change my opinion in a few years. Yeah, I looked around, but ultimately I knew a lot of people that I really trusted and I thought were incredibly smart at Anthropic, and I think that was the big deciding factor to come there. I was like, hey, this team's amazing. They're not just brilliant, but sort of like the most nice and kind people that I know, and so I just felt like I could be a really good culture fit. And ultimately, I do care a lot about AI safety and making sure that I don't want to build something that's used for bad purposes, and I felt like the best chance of that was joining Anthropic. Alessio [00:02:39]: And from the outside, these labs kind of look like huge organizations that have these obscure Swyx [00:02:44]: ways to organize. Alessio [00:02:45]: How did you get, you joined Anthropic, did you already know you were going to work on of the stuff you publish or you kind of join and then you figure out where you land? I think people are always curious to learn more. Erik [00:02:57]: Yeah, I've been very happy that Anthropic is very bottoms up and sort of very sort of receptive to whatever your interests are. And so I joined sort of being very transparent of like, hey, I'm most excited about code generation and AI that can actually go out and sort of touch the world or sort of help people build things. And, you know, those weren't my initial initial projects. I also came in and said, hey, I want to do the most valuable possible thing for this company and help Anthropic succeed. And, you know, like, let me find the balance of those. So I was working on lots of things at the beginning, you know, function calling, tool use. And then sort of as it became more and more relevant, I was like, oh, hey, like, let's it's time to go work on encoding agents and sort of started looking at SWE-Bench as sort of a really good benchmark for that. Swyx [00:03:39]: So let's get right into SWE-Bench. That's one of the many claims to fame. I feel like there's just been a series of releases related with Cloud 3.5 Sonnet around about two or three months ago, 3.5 Sonnet came out and it was it was a step ahead in terms of a lot of people immediately fell in love with it for coding. And then last month you released a new updated version of Cloud Sonnet. We're not going to talk about the training for that because that's s
We have a full slate of upcoming events: AI Engineer London, AWS Re:Invent in Las Vegas, and now Latent Space LIVE! at NeurIPS in Vancouver and online. Sign up to join and speak! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! We try to stay close to the inference providers as part of our coverage, as our podcasts with Together AI and Replicate will attest: However one of the most notable pull quotes from our very well received Braintrust episode was his opinion that open source model adoption has NOT gone very well and is actually declining in relative market share terms (it is of course increasing in absolute terms): Today’s guest, Lin Qiao, would wholly disagree. Her team of Pytorch/GPU experts are wholly dedicated toward helping you serve and finetune the full stack of open source models from Meta and others, across all modalities (Text, Audio, Image, Embedding, Vision-understanding), helping customers like Cursor and Hubspot scale up open source model inference both rapidly and affordably. Fireworks has emerged after its successive funding rounds with top tier VCs as one of the leaders of the Compound AI movement, a term first coined by the Databricks/Mosaic gang at Berkeley AI and adapted as “Composite AI” by Gartner: Replicating o1 We are the first podcast to discuss Fireworks’ f1, their proprietary replication of OpenAI’s o1. This has become a surprisingly hot area of competition in the past week as both Nous Forge and Deepseek r1 have launched competitive models. Full Video Podcast Like and subscribe! Timestamps * 00:00:00 Introductions * 00:02:08 Pre-history of Fireworks and PyTorch at Meta * 00:09:49 Product Strategy: From Framework to Model Library * 00:13:01 Compound AI Concept and Industry Dynamics * 00:20:07 Fireworks' Distributed Inference Engine * 00:22:58 OSS Model Support and Competitive Strategy * 00:29:46 Declarative System Approach in AI * 00:31:00 Can OSS replicate o1? * 00:36:51 Fireworks f1 * 00:41:03 Collaboration with Cursor and Speculative Decoding * 00:46:44 Fireworks quantization (and drama around it) * 00:49:38 Pricing Strategy * 00:51:51 Underrated Features of Fireworks Platform * 00:55:17 Hiring Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner at CTO at Danceable Partners, and I'm joined by my co-host, Swyx founder, Osmalayar. Swyx [00:00:11]: Hey, and today we're in a very special studio inside the Fireworks office with Lin Qiang, CEO of Fireworks. Welcome. Yeah. Lin [00:00:20]: Oh, you should welcome us. Swyx [00:00:21]: Yeah, welcome. Yeah, thanks for having us. It's unusual to be in the home of a startup, but it's also, I think our relationship is a bit unusual compared to all our normal guests. Definitely. Lin [00:00:34]: Yeah. I'm super excited to talk about very interesting topics in that space with both of you. Swyx [00:00:41]: You just celebrated your two-year anniversary yesterday. Lin [00:00:43]: Yeah, it's quite a crazy journey. We circle around and share all the crazy stories across these two years, and it has been super fun. All the way from we experienced Silicon Valley bank run to we delete some data that shouldn't be deleted operationally. We went through a massive scale where we actually are busy getting capacity to, yeah, we learned to kind of work with it as a team with a lot of brilliant people across different places to join a company. It has really been a fun journey. Alessio [00:01:24]: When you started, did you think the technical stuff will be harder or the bank run and then the people side? I think there's a lot of amazing researchers that want to do companies and it's like the hardest thing is going to be building the product and then you have all these different other things. So, were you surprised by what has been your experience the most? Lin [00:01:42]: Yeah, to be honest with you, my focus has always been on the product side and then after the product goes to market. And I didn't realize the rest has been so complicated, operating a company and so on. But because I don't think about it, I just kind of manage it. So it's done. I think I just somehow don't think about it too much and solve whatever problem coming our way and it worked. Swyx [00:02:08]: So let's, I guess, let's start at the pre-history, the initial history of Fireworks. You ran the PyTorch team at Meta for a number of years and we previously had Sumit Chintal on and I think we were just all very interested in the history of GenEI. Maybe not that many people know how deeply involved Faire and Meta were prior to the current GenEI revolution. Lin [00:02:35]: My background is deep in distributed system, database management system. And I joined Meta from the data side and I saw this tremendous amount of data growth, which cost a lot of money and we're analyzing what's going on. And it's clear that AI is driving all this data generation. So it's a very interesting time because when I joined Meta, Meta is going through ramping down mobile-first, finishing the mobile-first transition and then starting AI-first. And there's a fundamental reason about that sequence because mobile-first gave a full range of user engagement that has never existed before. And all this user engagement generated a lot of data and this data power AI. So then the whole entire industry is also going through, falling through this same transition. When I see, oh, okay, this AI is powering all this data generation and look at where's our AI stack. There's no software, there's no hardware, there's no people, there's no team. I want to dive up there and help this movement. So when I started, it's very interesting industry landscape. There are a lot of AI frameworks. It's a kind of proliferation of AI frameworks happening in the industry. But all the AI frameworks focus on production and they use a very certain way of defining the graph of neural network and then use that to drive the model iteration and productionization. And PyTorch is completely different. So they could also assume that he was the user of his product. And he basically says, researchers face so much pain using existing AI frameworks, this is really hard to use and I'm going to do something different for myself. And that's the origin story of PyTorch. PyTorch actually started as the framework for researchers. They don't care about production at all. And as they grow in terms of adoption, so the interesting part of AI is research is the top of our normal production. There are so many researchers across academic, across industry, they innovate and they put their results out there in open source and that power the downstream productionization. So it's brilliant for MATA to establish PyTorch as a strategy to drive massive adoption in open source because MATA internally is a PyTorch shop. So it creates a flying wheel effect. So that's kind of a strategy behind PyTorch. But when I took on PyTorch, it's kind of at Caspo, MATA established PyTorch as the framework for both research and production. So no one has done that before. And we have to kind of rethink how to architect PyTorch so we can really sustain production workload, the stability, reliability, low latency, all this production concern was never a concern before. Now it's a concern. And we actually have to adjust its design and make it work for both sides. And that took us five years because MATA has so many AI use cases, all the way from ranking recommendation as powering the business top line or as ranking newsfeed, video ranking to site integrity detect bad content automatically using AI to all kinds of effects, translation, image classification, object detection, all this. And also across AI running on the server side, on mobile phones, on AI VR devices, the wide spectrum. So by the time we actually basically managed to support AI across ubiquitous everywhere across MATA. But interestingly, through open source engagement, we work with a lot of companies. It is clear to us like this industry is starting to take on AI first transition. And of course, MATA's hyperscale always go ahead of industry. And it feels like when we start this AI journey at MATA, there's no software, no hardware, no team. For many companies we engage with through PyTorch, we feel the pain. That's the genesis why we feel like, hey, if we create fireworks and support industry going through this transition, it will be a huge amount of impact. Of course, the problem that the industry is facing will not be the same as MATA. MATA is so big, right? So it's kind of skewed towards extreme scale and extreme optimization in the industry will be different. But we feel like we have the technical chop and we've seen a lot. We'll look to kind of drive that. So yeah, so that's how we started. Swyx [00:06:58]: When you and I chatted about the origins of fireworks, it was originally envisioned more as a PyTorch platform, and then later became much more focused on generative AI. Is that fair to say? What was the customer discovery here? Lin [00:07:13]: Right. So I would say our initial blueprint is we should build a PyTorch cloud because a PyTorch library and there's no SaaS platform to enable AI workloads. Swyx [00:07:26]: Even in 2022, it's interesting. Lin [00:07:28]: I would not say absolutely no, but cloud providers have some of those, but it's not first class citizen, right? At 2022, there's still like TensorFlow is massively in production. And this is all pre-gen AI, and PyTorch is kind of getting more and more adoption. But there's no PyTorch-first SaaS platform existing. At the same time, we are also a very pragmatic set of people. We really want to make sure from the get-go, we get really, really close to customers. We understand their use case, we understand their pain points, we understand the value we deliver to them. So we want to take a different approach in
Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! If you've been following the AI agents space, you have heard of Lindy AI; while founder Flo Crivello is hesitant to call it "blowing up," when folks like Andrew Wilkinson start obsessing over your product, you're definitely onto something. In our latest episode, Flo walked us through Lindy's evolution from late 2022 to now, revealing some design choices about agent platform design that go against conventional wisdom in the space. The Great Reset: From Text Fields to Rails Remember late 2022? Everyone was "LLM-pilled," believing that if you just gave a language model enough context and tools, it could do anything. Lindy 1.0 followed this pattern: * Big prompt field ✅ * Bunch of tools ✅ * Prayer to the LLM gods ✅ Fast forward to today, and Lindy 2.0 looks radically different. As Flo put it (~17:00 in the episode): "The more you can put your agent on rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user." Instead of a giant, intimidating text field, users now build workflows visually: * Trigger (e.g., "Zendesk ticket received") * Required actions (e.g., "Check knowledge base") * Response generation This isn't just a UI change - it's a fundamental rethinking of how to make AI agents reliable. As Swyx noted during our discussion: "Put Shoggoth in a box and make it a very small, minimal viable box. Everything else should be traditional if-this-then-that software." The Surprising Truth About Model Limitations Here's something that might shock folks building in the space: with Claude 3.5 Sonnet, the model is no longer the bottleneck. Flo's exact words (~31:00): "It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small." Some context: Lindy started when context windows were 4K tokens. Today, their system prompt alone is larger than that. But what's really interesting is what this means for platform builders: * Raw capabilities aren't the constraint anymore * Integration quality matters more than model performance * User experience and workflow design are the new bottlenecks The Search Engine Parallel: Why Horizontal Platforms Might Win One of the spiciest takes from our conversation was Flo's thesis on horizontal vs. vertical agent platforms. He draws a fascinating parallel to search engines (~56:00): "I find it surprising the extent to which a horizontal search engine has won... You go through Google to search Reddit. You go through Google to search Wikipedia... search in each vertical has more in common with search than it does with each vertical." His argument: agent platforms might follow the same pattern because: * Agents across verticals share more commonalities than differences * There's value in having agents that can work together under one roof * The R&D cost of getting agents right is better amortized across use cases This might explain why we're seeing early vertical AI companies starting to expand horizontally. The core agent capabilities - reliability, context management, tool integration - are universal needs. What This Means for Builders If you're building in the AI agents space, here are the key takeaways: * Constrain First: Rather than maximizing capabilities, focus on reliable execution within narrow bounds * Integration Quality Matters: With model capabilities plateauing, your competitive advantage lies in how well you integrate with existing tools * Memory Management is Key: Flo revealed they actively prune agent memories - even with larger context windows, not all memories are useful * Design for Discovery: Lindy's visual workflow builder shows how important interface design is for adoption The Meta Layer There's a broader lesson here about AI product development. Just as Lindy evolved from "give the LLM everything" to "constrain intelligently," we might see similar evolution across the AI tooling space. The winners might not be those with the most powerful models, but those who best understand how to package AI capabilities in ways that solve real problems reliably. Full Video Podcast Flo’s talk at AI Engineer Summit Chapters * 00:00:00 Introductions * 00:04:05 AI engineering and deterministic software * 00:08:36 Lindys demo * 00:13:21 Memory management in AI agents * 00:18:48 Hierarchy and collaboration between Lindys * 00:21:19 Vertical vs. horizontal AI tools * 00:24:03 Community and user engagement strategies * 00:26:16 Rickrolling incident with Lindy * 00:28:12 Evals and quality control in AI systems * 00:31:52 Model capabilities and their impact on Lindy * 00:39:27 Competition and market positioning * 00:42:40 Relationship between Factorio and business strategy * 00:44:05 Remote work vs. in-person collaboration * 00:49:03 Europe vs US Tech * 00:58:59 Testing the Overton window and free speech * 01:04:20 Balancing AI safety concerns with business innovation Show Notes * Lindy.ai * Rick Rolling * Flo on X * TeamFlow * Andrew Wilkinson * Dust * Poolside.ai * SB1047 * Gathertown * Sid Sijbrandij * Matt Mullenweg * Factorio * Seeing Like a State Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:12]: Hey, and today we're joined in the studio by Florent Crivello. Welcome. Flo [00:00:15]: Hey, yeah, thanks for having me. Swyx [00:00:17]: Also known as Altimore. I always wanted to ask, what is Altimore? Flo [00:00:21]: It was the name of my character when I was playing Dungeons & Dragons. Always. I was like 11 years old. Swyx [00:00:26]: What was your classes? Flo [00:00:27]: I was an elf. I was a magician elf. Swyx [00:00:30]: Well, you're still spinning magic. Right now, you're a solo founder and CEO of Lindy.ai. What is Lindy? Flo [00:00:36]: Yeah, we are a no-code platform letting you build your own AI agents easily. So you can think of we are to LangChain as Airtable is to MySQL. Like you can just pin up AI agents super easily by clicking around and no code required. You don't have to be an engineer and you can automate business workflows that you simply could not automate before in a few minutes. Swyx [00:00:55]: You've been in our orbit a few times. I think you spoke at our Latent Space anniversary. You spoke at my summit, the first summit, which was a really good keynote. And most recently, like we actually already scheduled this podcast before this happened. But Andrew Wilkinson was like, I'm obsessed by Lindy. He's just created a whole bunch of agents. So basically, why are you blowing up? Flo [00:01:16]: Well, thank you. I think we are having a little bit of a moment. I think it's a bit premature to say we're blowing up. But why are things going well? We revamped the product majorly. We called it Lindy 2.0. I would say we started working on that six months ago. We've actually not really announced it yet. It's just, I guess, I guess that's what we're doing now. And so we've basically been cooking for the last six months, like really rebuilding the product from scratch. I think I'll list you, actually, the last time you tried the product, it was still Lindy 1.0. Oh, yeah. If you log in now, the platform looks very different. There's like a ton more features. And I think one realization that we made, and I think a lot of folks in the agent space made the same realization, is that there is such a thing as too much of a good thing. I think many people, when they started working on agents, they were very LLM peeled and chat GPT peeled, right? They got ahead of themselves in a way, and us included, and they thought that agents were actually, and LLMs were actually more advanced than they actually were. And so the first version of Lindy was like just a giant prompt and a bunch of tools. And then the realization we had was like, hey, actually, the more you can put your agent on Rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user, because you can really, as a user, you get, instead of just getting this big, giant, intimidating text field, and you type words in there, and you have no idea if you're typing the right word or not, here you can really click and select step by step, and tell your agent what to do, and really give as narrow or as wide a guardrail as you want for your agent. We started working on that. We called it Lindy on Rails about six months ago, and we started putting it into the hands of users over the last, I would say, two months or so, and I think things really started going pretty well at that point. The agent is way more reliable, way easier to set up, and we're already seeing a ton of new use cases pop up. Swyx [00:03:00]: Yeah, just a quick follow-up on that. You launched the first Lindy in November last year, and you were already talking about having a DSL, right? I remember having this discussion with you, and you were like, it's just much more reliable. Is this still the DSL under the hood? Is this a UI-level change, or is it a bigger rewrite? Flo [00:03:17]: No, it is a much bigger rewrite. I'll give you a concrete example. Suppose you want to have an agent that observes your Zendesk tickets, and it's like, hey, every time you receive a Zendesk ticket, I want you to check my knowledge base, so it's like a RAG module and whatnot, and then answer the ticket. The way it used to work with Lindy before was, you would type the prompt asking it to do that. You check my knowledge base, and so on and so forth. The problem with doing
We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We’ve had Harrison from LangChain on twice (as a guest and as a co-host), and we’ve now finally come full circle as Stanislas from Dust joined us in the studio. After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging." This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down." The History of Dust Dust's journey can be broken down into three phases: * Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn’t just `print` statements. * Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4. * Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments. The Case for Being Horizontal The big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages: * Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team. * Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions. * Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures. The Vertical Challenge However, this approach comes with trade-offs: * Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'" * Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales. * Product Surface Complexity: Creating an interface that's both powerful and accessible to non-technical users requires careful design decisions, down to avoiding technical terms like "system prompt" in favor of "instructions." The Future of AI Platforms Stan initially predicted we'd see the first billion-dollar single-person company in 2023 (a prediction later echoed by Sam Altman), but he's now more focused on a different milestone: billion-dollar companies with engineering teams of just 20 people, enabled by AI assistance. This vision aligns with Dust's horizontal platform approach – building the infrastructure that allows small teams to achieve outsized impact through AI augmentation. Rather than replacing entire job functions (the vertical approach), they're betting on augmenting existing workflows across organizations. Full YouTube Episode Chapters * 00:00:00 Introductions * 00:04:33 Joining OpenAI from Paris * 00:09:54 Research evolution and compute allocation at OpenAI * 00:13:12 Working with Ilya Sutskever and OpenAI's vision * 00:15:51 Leaving OpenAI to start Dust * 00:18:15 Early focus on browser extension and WebGPT-like functionality * 00:20:20 Dust as the infrastructure for agents * 00:24:03 Challenges of building with early AI models * 00:28:17 LLMs and Workflow Automation * 00:35:28 Building dependency graphs of agents * 00:37:34 Simulating API endpoints * 00:40:41 State of AI models * 00:43:19 Running evals * 00:46:36 Challenges in building AI agents infra * 00:49:21 Buy vs. build decisions for infrastructure components * 00:51:02 Future of SaaS and AI's Impact on Software * 00:53:07 The single employee $1B company race * 00:56:32 Horizontal vs. vertical approaches to AI agents Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:11]: Hey, and today we're in a studio with Stanislas, welcome. Stan [00:00:14]: Thank you very much for having me. Swyx [00:00:16]: Visiting from Paris. Stan [00:00:17]: Paris. Swyx [00:00:18]: And you have had a very distinguished career. It's very hard to summarize, but you went to college in both Ecopolytechnique and Stanford, and then you worked in a number of places, Oracle, Totems, Stripe, and then OpenAI pre-ChatGPT. We'll talk, we'll spend a little bit of time about that. About two years ago, you left OpenAI to start Dust. I think you were one of the first OpenAI alum founders. Stan [00:00:40]: Yeah, I think it was about at the same time as the Adept guys, so that first wave. Swyx [00:00:46]: Yeah, and people really loved our David episode. We love a few sort of OpenAI stories, you know, for back in the day, like we're talking about pre-recording. Probably the statute of limitations on some of those stories has expired, so you can talk a little bit more freely without them coming after you. But maybe we'll just talk about, like, what was your journey into AI? You know, you were at Stripe for almost five years, there are a lot of Stripe alums going into OpenAI. I think the Stripe culture has come into OpenAI quite a bit. Stan [00:01:11]: Yeah, so I think the buses of Stripe people really started flowing in, I guess, after ChatGPT. But, yeah, my journey into AI is a... I mean, Greg Brockman. Yeah, yeah. From Greg, of course. And Daniela, actually, back in the days, Daniela Amodei. Swyx [00:01:27]: Yes, she was COO, I mean, she is COO, yeah. She had a pretty high job at OpenAI at the time, yeah, for sure. Stan [00:01:34]: My journey started as anybody else, you're fascinated with computer science and you want to make them think, it's awesome, but it doesn't work. I mean, it was a long time ago, it was like maybe 16, so it was 25 years ago. Then the first big exposure to AI would be at Stanford, and I'm going to, like, disclose a whole lamb, because at the time it was a class taught by Andrew Ng, and there was no deep learning. It was half features for vision and a star algorithm. So it was fun. But it was the early days of deep learning. At the time, I think a few years after, it was the first project at Google. But you know, that cat face or the human face trained from many images. I went to, hesitated doing a PhD, more in systems, eventually decided to go into getting a job. Went at Oracle, started a company, did a gazillion mistakes, got acquired by Stripe, worked with Greg Buckman there. And at the end of Stripe, I started interesting myself in AI again, felt like it was the time, you had the Atari games, you had the self-driving craziness at the time. And I started exploring projects, it felt like the Atari games were incredible, but there were still games. And I was looking into exploring projects that would have an impact on the world. And so I decided to explore three things, self-driving cars, cybersecurity and AI, and math and AI. It's like I sing it by a decreasing order of impact on the world, I guess. Swyx [00:03:01]: Discovering new math would be very foundational. Stan [00:03:03]: It is extremely foundational, but it's not as direct as driving people around. Swyx [00:03:07]: Sorry, you're doing this at Stripe, you're like thinking about your next move. Stan [00:03:09]: No, it was at Stripe, kind of a bit of time where I started exploring. I did a bunch of work with friends on trying to get RC cars to drive autonomously. Almost started a company in France or Europe about self-driving trucks. We decided to not go for it because it was probably very operational. And I think the idea of the company, of the team wasn't there. And also I realized that if I wake up a day and because of a bug I wrote, I killed a family, it would be a bad experience. And so I just decided like, no, that's just too crazy. And then I explored cybersecurity with a friend. We're trying to apply transformers to cut fuzzing. So cut fuzzing, you have kind of an algorithm that goes really fast and tries to mutate the inputs of a library to find bugs.
Apologies for lower audio quality; we lost recordings and had to use backup tracks. Our guests today are Anastasios Angelopoulos and Wei-Lin Chiang, leads of Chatbot Arena, fka LMSYS, the crowdsourced AI evaluation platform developed by the LMSys student club at Berkeley, which became the de facto standard for comparing language models. Arena Elo is often more cited than MMLU scores to many folks, and they have attracted >1,000,000 people to cast votes since its launch, leading top model trainers to cite them over their own formal academic benchmarks: The Limits of Static Benchmarks We’ve done two benchmarks episodes: Benchmarks 101 and Benchmarks 201. One issue we’ve always brought up with static benchmarks is that 1) many are getting saturated, with models scoring almost perfectly on them 2) they often don’t reflect production use cases, making it hard for developers and users to use them as guidance. The fundamental challenge in AI evaluation isn't technical - it's philosophical. How do you measure something that increasingly resembles human intelligence? Rather than trying to define intelligence upfront, Arena let users interact naturally with models and collect comparative feedback. It's messy and subjective, but that's precisely the point - it captures the full spectrum of what people actually care about when using AI. The Pareto Frontier of Cost vs Intelligence Because the Elo scores are remarkably stable over time, we can put all the chat models on a map against their respective cost to gain a view of at least 3 orders of magnitude of model sizes/costs and observe the remarkable shift in intelligence per dollar over the past year: This frontier stood remarkably firm through the recent releases of o1-preview and price cuts of Gemini 1.5: The Statistics of Subjectivity In our Benchmarks 201 episode, Clémentine Fourrier from HuggingFace thought this design choice was one of shortcomings of arenas: they aren’t reproducible. You don’t know who ranked what and what exactly the outcome was at the time of ranking. That same person might rank the same pair of outputs differently on a different day, or might ask harder questions to better models compared to smaller ones, making it imbalanced. Another argument that people have brought up is confirmation bias. We know humans prefer longer responses and are swayed by formatting - Rob Mulla from Dreadnode had found some interesting data on this in May: The approach LMArena is taking is to use logistic regression to decompose human preferences into constituent factors. As Anastasios explains: "We can say what components of style contribute to human preference and how they contribute." By adding these style components as parameters, they can mathematically "suck out" their influence and isolate the core model capabilities. This extends beyond just style - they can control for any measurable factor: "What if I want to look at the cost adjusted performance? Parameter count? We can ex post facto measure that." This is one of the most interesting things about Arena: You have a data generation engine which you can clean and turn into leaderboards later. If you wanted to create a leaderboard for poetry writing, you could get existing data from Arena, normalize it by identifying these style components. Whether or not it’s possible to really understand WHAT bias the voters have, that’s a different question. Private Evals One of the most delicate challenges LMSYS faces is maintaining trust while collaborating with AI labs. The concern is that labs could game the system by testing multiple variants privately and only releasing the best performer. This was brought up when 4o-mini released and it ranked as the second best model on the leaderboard: But this fear misunderstands how Arena works. Unlike static benchmarks where selection bias is a major issue, Arena's live nature means any initial bias gets washed out by ongoing evaluation. As Anastasios explains: "In the long run, there's way more fresh data than there is data that was used to compare these five models." The other big question is WHAT model is actually being tested; as people often talk about on X / Discord, the same endpoint will randomly feel “nerfed” like it happened for “Claude European summer” and corresponding conspiracy theories: It’s hard to keep track of these performance changes in Arena as these changes (if real…?) are not observable. The Future of Evaluation The team's latest work on RouteLLM points to an interesting future where evaluation becomes more granular and task-specific. But they maintain that even simple routing strategies can be powerful - like directing complex queries to larger models while handling simple tasks with smaller ones. Arena is now going to expand beyond text into multimodal evaluation and specialized domains like code execution and red teaming. But their core insight remains: the best way to evaluate intelligence isn't to simplify it into metrics, but to embrace its complexity and find rigorous ways to analyze it. To go after this vision, they are spinning out Arena from LMSys, which will stay as an academia-driven group at Berkeley. Full Video Podcast Chapters * 00:00:00 - Introductions * 00:01:16 - Origin and development of Chatbot Arena * 00:05:41 - Static benchmarks vs. Arenas * 00:09:03 - Community building * 00:13:32 - Biases in human preference evaluation * 00:18:27 - Style Control and Model Categories * 00:26:06 - Impact of o1 * 00:29:15 - Collaborating with AI labs * 00:34:51 - RouteLLM and router models * 00:38:09 - Future of LMSys / Arena Show Notes * Anastasios Angelopoulos * Anastasios' NeurIPS Paper Conformal Risk Control * Wei-Lin Chiang * Chatbot Arena * LMSys * MTBench * ShareGPT dataset * Stanford's Alpaca project * LLMRouter * E2B * Dreadnode Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:14]: Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from LMSys. Welcome guys. Wei Lin [00:00:21]: Hey, how's it going? Nice to see you. Anastasios [00:00:23]: Thanks for having us. Swyx [00:00:24]: Anastasios, I actually saw you, I think at last year's NeurIPS. You were presenting a paper, which I don't really super understand, but it was some theory paper about how your method was very dominating over other sort of search methods. I don't remember what it was, but I remember that you were a very confident speaker. Anastasios [00:00:40]: Oh, I totally remember you. Didn't ever connect that, but yes, that's definitely true. Yeah. Nice to see you again. Swyx [00:00:46]: Yeah. I was frantically looking for the name of your paper and I couldn't find it. Basically I had to cut it because I didn't understand it. Anastasios [00:00:51]: Is this conformal PID control or was this the online control? Wei Lin [00:00:55]: Blast from the past, man. Swyx [00:00:57]: Blast from the past. It's always interesting how NeurIPS and all these academic conferences are sort of six months behind what people are actually doing, but conformal risk control, I would recommend people check it out. I have the recording. I just never published it just because I was like, I don't understand this enough to explain it. Anastasios [00:01:14]: People won't be interested. Wei Lin [00:01:15]: It's all good. Swyx [00:01:16]: But ELO scores, ELO scores are very easy to understand. You guys are responsible for the biggest revolution in language model benchmarking in the last few years. Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history of LMSys Wei Lin [00:01:32]: Hey, I'm Wei Lin. I'm a fifth year PhD student at UC Berkeley, working on Chatbot Arena these days, doing crowdsourcing AI benchmarking. Anastasios [00:01:43]: I'm Anastasios. I'm a sixth year PhD student here at Berkeley. I did most of my PhD on like theoretical statistics and sort of foundations of model evaluation and testing. And now I'm working 150% on this Chatbot Arena stuff. It's great. Alessio [00:02:00]: And what was the origin of it? How did you come up with the idea? How did you get people to buy in? And then maybe what were one or two of the pivotal moments early on that kind of made it the standard for these things? Wei Lin [00:02:12]: Yeah, yeah. Chatbot Arena project was started last year in April, May, around that. Before that, we were basically experimenting in a lab how to fine tune a chatbot open source based on the Llama 1 model that I released. At that time, Lama 1 was like a base model and people didn't really know how to fine tune it. So we were doing some explorations. We were inspired by Stanford's Alpaca project. So we basically, yeah, grow a data set from the internet, which is called ShareGPT data set, which is like a dialogue data set between user and chat GPT conversation. It turns out to be like pretty high quality data, dialogue data. So we fine tune on it and then we train it and release the model called V2. And people were very excited about it because it kind of like demonstrate open way model can reach this conversation capability similar to chat GPT. And then we basically release the model with and also build a demo website for the model. People were very excited about it. But during the development, the biggest challenge to us at the time was like, how do we even evaluate it? How do we even argue this model we trained is better than others? And then what's the gap between this open source model that other proprietary offering? At that time, it was like GPT-4 was just announced and it's like Cloud One. What's the difference between them? And then after that, like every week, there's a new model being fine tuned, released. So even until still now, right? And then we have that demo website for V2 now. And then we thought like, okay, maybe we ca
If you’ve listened to the podcast for a while, you might have heard our ElevenLabs-powered AI co-host Charlie a few times. Text-to-speech has made amazing progress in the last 18 months, with OpenAI’s Advanced Voice Mode (aka “Her”) as a sneak peek of the future of AI interactions (see our “Building AGI in Real Time” recap). Yet, we had yet to see a real killer app for AI voice (not counting music). Today’s guests, Raiza Martin and Usama Bin Shafqat, are the lead PM and AI engineer behind the NotebookLM feature flag that gave us the first viral AI voice experience, the “Deep Dive” podcast: The idea behind the “Audio Overviews” feature is simple: take a bunch of documents, websites, YouTube videos, etc, and generate a podcast out of them. This was one of the first demos that people built with voice models + RAG + GPT models, but it was always a glorified speech-to-text. Raiza and Usama took a very different approach: * Make it conversational: when you listen to a NotebookLM audio there are a ton of micro-interjections (Steven Johnson calls them disfluencies) like “Oh really?” or “Totally”, as well as pauses and “uh…”, like you would expect in a real conversation. These are not generated by the LLM in the transcript, but they are built into the the audio model. See ~28:00 in the pod for more details. * Listeners love tension: if two people are always in agreement on everything, it’s not super interesting. They tuned the model to generate flowing conversations that mirror the tone and rhythm of human speech. They did not confirm this, but many suspect the 2 year old SoundStorm paper is related to this model. * Generating new insights: because the hosts’ goal is not to summarize, but to entertain, it comes up with funny metaphors and comparisons that actually help expand on the content rather than just paraphrasing like most models do. We have had listeners make podcasts out of our podcasts, like this one. This is different than your average SOTA-chasing, MMLU-driven model buildooor. Putting product and AI engineering in the same room, having them build evals together, and understanding what the goal is lets you get these unique results. The 5 rules for AI PMs We always focus on AI Engineers, but this episode had a ton of AI PM nuggets as well, which we wanted to collect as NotebookLM is one of the most successful products in the AI space: 1. Less is more: the first version of the product had 0 customization options. All you could do is give it source documents, and then press a button to generate. Most users don’t know what “temperature” or “top-k” are, so you’re often taking the magic away by adding more options in the UI. Since recording they added a few, like a system prompt, but those were features that users were “hacking in”, as Simon Willison highlighted in his blog post. 2. Use Real-Time Feedback: they built a community of 65,000 users on Discord that is constantly reporting issues and giving feedback; sometimes they noticed server downtime even before the Google internal monitoring did. Getting real time pings > aggregating user data when doing initial iterations. 3. Embrace Non-Determinism: AI outputs variability is a feature, not a bug. Rather than limiting the outputs from the get-go, build toggles that you can turn on/off with feature flags as the feedback starts to roll in. 4. Curate with Taste: if you try your product and it sucks, you don’t need more data to confirm it. Just scrap that and iterate again. This is even easier for a product like this; if you start listening to one of the podcasts and turn it off after 10 seconds, it’s never a good sign. 5. Stay Hands-On: It’s hard to build taste if you don’t experiment. Trying out all your competitors products as well as unrelated tools really helps you understand what users are seeing in market, and how to improve on it. Chapters 00:00 Introductions01:39 From Project Tailwind to NotebookLM09:25 Learning from 65,000 Discord members12:15 How NotebookLM works18:00 Working with Steven Johnson23:00 How to prioritize features25:13 Structuring the data pipelines29:50 How to eval34:34 Steering the podcast outputs37:51 Defining speakers personalities39:04 How do you make audio engaging?45:47 Humor is AGI51:38 Designing for non-determinism53:35 API when?55:05 Multilingual support and dialect considerations57:50 Managing system prompts and feature requests01:00:58 Future of NotebookLM01:04:59 Podcasts for your codebase01:07:16 Plans for real-time chat01:08:27 Wrap up Show Notes * Notebook LM * AI Test Kitchen * Nicholas Carlini * Steven Johnson * Wealth of Nations * Histories of Mysteries by Andrej Karpathy * chicken.pdf Threads * Area 120 * Raiza Martin * Usama Bin Shafqat Transcript NotebookLM [00:00:00]: Hey everyone, we're here today as guests on Latent Space. It's great to be here, I'm a long time listener and fan, they've had some great guests on this show before. Yeah, what an honor to have us, the hosts of another podcast, join as guests. I mean a huge thank you to Swyx and Alessio for the invite, thanks for having us on the show. Yeah really, it seems like they brought us here to talk a little bit about our show, our podcast. Yeah, I mean we've had lots of listeners ourselves, listeners at Deep Dive. Oh yeah, we've made a ton of audio overviews since we launched and we're learning a lot. There's probably a lot we can share around what we're building next, huh? Yeah, we'll share a little bit at least. The short version is we'll keep learning and getting better for you. We're glad you're along for the ride. So yeah, keep listening. Keep listening and stay curious. We promise to keep diving deep and bringing you even better options in the future. Stay curious. Alessio [00:00:52]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners. And I'm joined by my co-host, Swyx, founder of Smol.ai. Swyx [00:01:01]: Hey, and today we're back in the studio with our special guest, Raiza Martin. And Raiza, I forgot to get your last name, Shafqat. Raiza [00:01:10]: Yes. Swyx [00:01:10]: Okay, welcome. Raiza [00:01:12]: Hello, thank you for having us. Swyx [00:01:14]: So AI podcasters meet human podcasters, always fun. Congrats on the success of Notebook LM. I mean, how does it feel? Raiza [00:01:22]: It's been a lot of fun. A lot of it, honestly, was unexpected. But my favorite part is really listening to the audio overviews that people have been making. Swyx [00:01:29]: Maybe we should do a little bit of intros and tell the story. You know, what is your path into the sort of Google AI org? Or maybe, actually, I don't even know what org you guys are in. Raiza [00:01:39]: I can start. My name is Raisa. I lead the Notebook LM team inside of Google Labs. So specifically, that's the org that we're in. It's called Google Labs. It's only about two years old. And our whole mandate is really to build AI products. That's it. We work super closely with DeepMind. Our entire thing is just, like, try a bunch of things and see what's landing with users. And the background that I have is, really, I worked in payments before this, and I worked in ads right before, and then startups. I tell people, like, at every time that I changed orgs, I actually almost quit Google. Like, specifically, like, in between ads and payments, I was like, all right, I can't do this. Like, this is, like, super hard. I was like, it's not for me. I'm, like, a very zero-to-one person. But then I was like, okay, I'll try. I'll interview with other teams. And when I interviewed in payments, I was like, oh, these people are really cool. I don't know if I'm, like, a super good fit with this space, but I'll try it because the people are cool. And then I really enjoyed that, and then I worked on, like, zero-to-one features inside of payments, and I had a lot of fun. But then the time came again where I was like, oh, I don't know. It's like, it's time to leave. It's time to start my own thing. But then I interviewed inside of Google Labs, and I was like, oh, darn. Like, there's definitely, like— Alessio [00:02:48]: They got you again. Raiza [00:02:49]: They got me again. And so now I've been here for two years, and I'm happy that I stayed because especially with, you know, the recent success of Notebook LM, I'm like, dang, we did it. I actually got to do it. So that was really cool. Usama [00:03:02]: Kind of similar, honestly. I was at a big team at Google. We do sort of the data center supply chain planning stuff. Google has, like, the largest sort of footprint. Obviously, there's a lot of management stuff to do there. But then there was this thing called Area 120 at Google, which does not exist anymore. But I sort of wanted to do, like, more zero-to-one building and landed a role there. We were trying to build, like, a creator commerce platform called Kaya. It launched briefly a couple years ago. But then Area 120 sort of transitioned and morphed into Labs. And, like, over the last few years, like, the focus just got a lot clearer. Like, we were trying to build new AI products and do it in the wild and sort of co-create and all of that. So, you know, we've just been trying a bunch of different things. And this one really landed, which has felt pretty phenomenal. Really, really landed. Swyx [00:03:53]: Let's talk about the brief history of Notebook LM. You had a tweet, which is very helpful for doing research. May 2023, during Google I.O., you announced Project Tailwind. Raiza [00:04:03]: Yeah. Swyx [00:04:03]: So today is October 2024. So you joined October 2022? Raiza [00:04:09]: Actually, I used to lead AI Test Kitchen. And this was actually, I think, not I.O. 2023. I.O. 2022 is when we launched AI Test Kitchen, or announced it. And I don't know if you remember it. Swyx [00:04:23]: That's how you, like, had the basic prototype for Gemini. Raiza [00:04:26]: Yes, yes, exactly. La
Singapore's GovTech is hosting an AI CTF challenge with ~$15,000 in prizes, starting October 26th, open to both local and virtual hackers. It will be hosted on Dreadnode's Crucible platform; signup here! It is common to say if you want to work in AI, you should come to San Francisco. Not everyone can. Not everyone should. If you can only do meaningful AI work in one city, then AI has failed to generalize meaningfully. As non-Americans working in the US, we know what it’s like to see AI progress so rapidly here, and yet be at a loss for what our home countries can do. Through Latent Space we’ve tried to tell the story of AI outside of the Bay Area bubble; we talked to Notion in New York and Humanloop and Wondercraft in London and HuggingFace in Paris and ICLR in Vienna, and the Reka, RWKV, and Winds of AI Winter episodes were taped in Singapore (the World’s Fair also had Latin America representation and we intend to at least add China, Japan, and India next year). The Role of Government with AI As an intentionally technical resource, we’ve mostly steered clear of regulation and safety debates on the podcast; whether it is safety bills or technoalarmism, often at the cost of our engagement numbers or ability to book big name guests with a political agenda. When SOTA shifts 3x faster than it takes to pass a law, when nobody agrees on definitions of important things, when you can elicit never-before-seen behavior by slightly different prompting or sampling, it is hard enough to simply keep up to speed, so we are happy limiting our role to that. The story of AI progress has more often been achieved in the private sector, usually in spite of, rather than with thanks to, government intervention. But industrial policy is inextricably linked to the business of AI, which we do very much care about, has an explicitly accelerationist intent if not impact, and has a track record of success in correcting for legitimate market failures in private sector investment, particularly outside of the US. It is with this lens we approach today’s episode and special guest, our first with a sitting Cabinet member. Singapore’s National AI Strategy It is well understood that much of Singapore’s economic success is attributable to industrial policy, from direct efforts like the Jurong Town Corporation industrialization to indirect ones like going all in on English as national first language. Singapore’s National AI Strategy grew out of its 2014 Smart Nation initiative, first launched in 2019 and then refreshed in 2023 by Minister Josephine Teo, our guest today. While Singapore is not often thought of as an AI leader, the National University ranks in the top 10 in publications (above Oxford/Harvard!), and many overseas Singaporeans work at the leading AI companies and institutions in the US (and some of us even run leading AI Substacks?). OpenAI has often publicly named the Singapore government as their model example of government collaborator and is opening an office in Singapore in time for DevDay 2024. AI Engineer Nations Swyx first pitched the AI Engineer Nation concept at a private Sovereign AI summit featuring Dr. He Ruimin, Chief AI Officer of Singapore, which eventually led to an invitation to discuss the concept with Minister Teo, the country’s de-facto minister for tech (she calls it Digital Development, for good reasons she explains in the pod). This chat happened (with thanks to Jing Long, Joyce, and other folks from MDDI)! The central pitch for any country, not just Singapore, to emphasize and concentrate bets on AI Engineers, compared with other valuable efforts like training more researchers, releasing more government-approved data, or offering more AI funding, is a calculated one, based on the fact that: * GPU clusters and researchers have massive returns to scale and colocation, mostly concentrated in the US, that are irresponsibly expensive to replicate * Even if research stopped today and there was no progress for the next 30 years, there are far more capabilities to unlock and productize from existing foundation models and we * Good AI Engineering requires genuine skill and is deepening enough to justify sub-specialization as a sub-industry of Software Engineering * Companies and countries with better AI engineer workforces will disproportionately benefit from AI vs those who equivocate it as one of many equivalent priorities * Tech progress is often framed as “the future is here but it is not evenly distributed”. The role of the AI Engineer is therefore to better distribute the state of the art to as much of humanity as possible, including the elderly, poor, and differently abled. All of which are themes we first identified in the Rise of the AI Engineer. Singapore simply has a few additional factors that make it not just a good fit, but an economic imperative: * English speaking, very-online country that is great at STEM * Aging, ex-growth population (Total Fertility Rate of 1.1) * #3 GDP per capita (PPP) country in the world * Physically remote from major economic growth centers ex China/SEA That basically dictates that any continued economic growth must be disconnected to geography, timezone, or headcount, or reliance on existing industrial drivers. Short of holding Taylor Swift hostage, making an intentional, concentrated bet on AI industrial policy is Singapore’s best option to keep up progress in the 21st century. As a pioneer in education policy being the primary long term determinant of economic success, this may result in Python as Singapore’s next National Language in the long run, a proposal we also discussed extensively at the RAISE retreat where this episode was recorded. Because of upcoming election season concerns around the globe, we also took the opportunity to ask about Singapore’s recent deepfake (election integrity) law. Full YouTube episode Show Notes * Josephine Teo Official Bio, Wikipedia * Singapore National AI Strategy * 2019 - v1 * 2023 - v2 * ICLR (machine learning conference) * Philipp Kandal (CPO of Grab) * Temasek * GIC * EDBI * Economic Development Board (EDB) * Michael Fay incident * Quincy Larson * AIBots (internal RAG system for Singapore government) * Slovakia election incident * National AI Strategy - Singapore * Singapore AI Safety Institute * AI Verify * SkillsFuture * Ministry of Digital Development and Information (MDDI) * GovTech * NTU (Nanyang Technological University) Timestamps 00:00:00 Introductions00:00:34 Singapore's National AI Strategy00:02:50 Ministry of Digital Development and Information00:08:49 Defining a National AI Strategy00:14:32 AI Safety and Governance00:16:50 AI Adoption in Companies and Government00:19:53 Balancing AI Innovation and Safety00:22:56 Structuring Government for Rapid Technological Change00:27:08 Doing Business with Singapore00:32:21 Training and Workforce Development in AI00:37:05 Career Transition Help for Post-AI Jobs00:40:19 AI Literacy and Coding as a Language00:43:28 Sovereign AI and Digital Infrastructure00:50:48 Government and AI Workloads00:51:02 Favorite AI Use Case in Government00:53:52 AI and Elections Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small.ai. Swyx [00:00:13]: Hey everyone, this is a very, very special episode. We have here Mr. Josephine Teo from Singapore. Welcome. Josephine [00:00:19]: Hi Shawn and hi Alessio. Thank you for having me. Of course. Swyx [00:00:23]: You are the Minister for Digital Development and Information and Second Minister for Home Affairs. We're meeting here at RAISE, which is effectively your agency. Maybe we want to explain a little bit about what Singapore is doing in AI. Josephine [00:00:34]: Well, we've had an AI strategy at the national level for some years now, and about two years ago when generative AI became so prominent, we thought it was about time for us to refresh our national AI strategy. And it's not unusual on such occasions for us to consult widely. We want to talk to people who are familiar with the field. We want to talk to people who are active as practitioners, and we also want to talk to people in Singapore who have an interest in seeing the AI ecosystem develop. So when we put all these together, we discovered something else by chance, and it was really a bonus. This was the fact that there were already Singaporeans that were active in the AI space, particularly in the US, particularly in the Bay Area. And one of the exciting things for us was how could we also consult these Singaporeans who clearly still have a passion for Singapore, they do care about what happens back home, and they want to contribute to it. So that's how RAISE came about. And RAISE actually preceded the publication of the refresh of our national AI strategy, which took place in December last year. So the inputs of the participants from RAISE helped us to sharpen what we thought would be important in building up the AI ecosystem. And also with the encouragement of participants at RAISE, primarily Singaporeans who were doing great work in the US, we decided to raise our ambitions, literally. That's why we say AI for the public good, recognising the fact that commercial interest will certainly drive exciting developments in the industry space. But keep in mind, there is a need to make sure that AI serves the public good. And we say for Singapore and the world. So the idea is that experiments that are carried out in Singapore, things that are scaled up in Singapore potentially could have contributions elsewhere in the world. And so AI for the public good, for Singapore and the world. That's how it came about. Alessio [00:02:50]: I was listening to some of your previous interviews, and even the choice of the name development in the ministry name was very specific. You mentioned naming is your ethos. Can you explain maybe a
CEOs of publicly traded companies are often in the news talking about their new AI initiatives, but few of them have built anything with it. Drew Houston from Dropbox is different; he has spent over 400 hours coding with LLMs in the last year and is now refocusing his 2,500+ employees around this new way of working, 17 years after founding the company. Timestamps 00:00 Introductions 00:43 Drew's AI journey 04:14 Revalidating expectations of AI 08:23 Simulation in self-driving vs. knowledge work 12:14 Drew's AI Engineering setup 15:24 RAG vs. long context in AI models 18:06 From "FileGPT" to Dropbox AI 23:20 Is storage solved?26:30 Products vs Features 30:48 Building trust for data access 33:42 Dropbox Dash and universal search 38:05 The evolution of Dropbox 42:39 Building a "silicon brain" for knowledge work 48:45 Open source AI and its impact 51:30 "Rent, Don't Buy" for AI 54:50 Staying relevant 58:57 Founder Mode 01:03:10 Advice for founders navigating AI 01:07:36 Building and managing teams in a growing company Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and there's no Swyx today, but I'm joined by Drew Houston of Dropbox. Welcome, Drew. Drew [00:00:14]: Thanks for having me. Alessio [00:00:15]: So we're not going to talk about the Dropbox story. We're not going to talk about the Chinatown bus and the flash drive and all that. I think you've talked enough about it. Where I want to start is you as an AI engineer. So as you know, most of our audience is engineering folks, kind of like technology leaders. You obviously run Dropbox, which is a huge company, but you also do a lot of coding. I think that's how you spend almost 400 hours, just like coding. So let's start there. What was the first interaction you had with an LLM API and when did the journey start for you? Drew [00:00:43]: Yeah. Well, I think probably all AI engineers or whatever you call an AI engineer, those people started out as engineers before that. So engineering is my first love. I mean, I grew up as a little kid. I was that kid. My first line of code was at five years old. I just really loved, I wanted to make computer games, like this whole path. That also led me into startups and eventually starting Dropbox. And then with AI specifically, I studied computer science, I got my, I did my undergrad, but I didn't do like grad level computer science. I didn't, I sort of got distracted by all the startup things, so I didn't do grad level work. But about several years ago, I made a couple of things. So one is I sort of, I knew I wanted to go from being an engineer to a founder. And then, but sort of the becoming a CEO part was sort of backed into the job. And so a couple of realizations. One is that, I mean, there's a lot of like repetitive and like manual work you have to do as an executive that is actually lends itself pretty well to automation, both for like my own convenience. And then out of interest in learning, I guess what we call like classical machine learning these days, I started really trying to wrap my head around understanding machine learning and informational retrieval more, more formally. So I'd say maybe 2016, 2017 started me writing these more successively, more elaborate scripts to like understand basic like classifiers and regression and, and again, like basic information retrieval and NLP back in those days. And there's sort of like two things that came out of that. One is techniques are super powerful. And even just like studying like old school machine learning was a pretty big inversion of the way I had learned engineering, right? You know, I started programming when everyone starts programming and you're, you're sort of the human, you're giving an algorithm to the, and spelling out to the computer how it should run it. And then machine learning, here's machine learning where it's like actually flip that, like give it sort of the answer you want and it'll figure out the algorithm, which was pretty mind bending. And it was both like pretty powerful when I would write tools, like figure out like time audits or like, where's my time going? Is this meeting a one-on-one or is it a recruiting thing or is it a product strategy thing? I started out doing that manually with my assistant, but then found that this was like a very like automatable task. And so, which also had the side effect of teaching me a lot about machine learning. But then there was this big problem, like anytime you, it was very good at like tabular structured data, but like anytime it hit, you know, the usual malformed English that humans speak, it would just like fall over. I had to kind of abandon a lot of the things that I wanted to build because like there's no way to like parse text. Like maybe it would sort of identify the part of speech in a sentence or something. But then fast forward to the LLM, I mean actually I started trying some of like this, what we would call like very small LLMs before kind of the GPT class models. And it was like super hard to get those things working. So like these 500 parameter models would just be like hallucinating and repeating and you know. So actually I'd kind of like written it off a little bit. But then the chat GPT launch and GPT-3 for sure. And then once people figured out like prompting and instruction tuning, this was sort of like November-ish 2022 like everybody else sort of that the chat GPT launch being the starting gun for the whole AI era of computing and then having API access to three and then early access to GPT-4. I was like, oh man, it's happening. And so I was literally on my honeymoon and we're like on a beach in Thailand and I'm like coding these like AI tools to automate like writing or to assist with writing and all these different use cases. Alessio [00:04:14]: You're like, I'm never going back to work. I'm going to automate all of it before I get back. Drew [00:04:17]: And I was just, you know, ever since then, I mean, I've always been like coding like prototypes and just stuff to make my life more convenient, but like escalated a lot after 22. And yeah, I spent, I checked, I think it was probably like over 400 hours this year so far coding because I had my paternity leave where I was able to work on some special projects. But yeah, it's a super important part of like my whole learning journey is like being really hands-on with these things. And I mean, it's probably not a typical recipe, but I really love to get down to the metal as far as how this stuff works. Alessio [00:04:47]: Yeah. So Swyx and I were with Sam Altman in October 22. We were like at a hack day at OpenAI and that's why we started this podcast eventually. But you did an interview with Sam like seven years ago and he asked you what's the biggest opportunity in startups and you were like machine learning and AI and you were almost like too early, right? It's like maybe seven years ago, the models weren't quite there. How should people think about revalidating like expectations of this technology? You know, I think even today people will tell you, oh, models are not really good at X because they were not good 12 months ago, but they're good today. Drew [00:05:19]: What's your project? Heuristics for thinking about that or how is, yeah, I think the way I look at it now is pretty, has evolved a lot since when I started. I mean, I think everybody intuitively starts with like, all right, let's try to predict the future or imagine like what's this great end state we're going to get to. And the tricky thing is like often those prognostications are right, but they're right in terms of direction, but not when. For example, you know, even in the early days of the internet, 90s when things were even like tech space and you know, even before like the browser or things like that, people were like, oh man, you're going to have, you know, you're going to be able to order food, get like a Snickers delivered to your house, you're going to be able to watch any movie ever created. And they were right. But they were like, you know, it took 20 years for that to actually happen. And before you got to DoorDash, you had to get, you started with like Webvan and Cosmo and before you get to Spotify, you had to do like Napster and Kazaa and LimeWire and like a bunch of like broken Britney Spears MP3s and malware. So I think the big lesson is being early is the same as being wrong. Being late is the same as being wrong. So really how do you calibrate timing? And then I think with AI, it's the same thing that people are like, oh, it's going to completely upend society and all these positive and negative ways. I think that's like most of those things are going to come true. The question is like, when is that going to happen? And then with AI specifically, I think there's also, in addition to sort of the general tech category or like jumping too fast to the future, I think that AI is particularly susceptible to that. And you look at self-driving, right? This idea of like, oh my God, you can have a self-driving car captured everybody's imaginations 10, 12 years ago. And you know, people are like, oh man, in two years, there's not going to be another year. There's not going to be a human driver on the road to be seen. It didn't work out that way, right? We're still 10, 12 years later where we're in a world where you can sort of sometimes get a Waymo in like one city on earth. Exciting, but just took a lot longer than people think. And the reason is there's a lot of engineering challenges, but then there's a lot of other like societal time constants that are hard to compress. So one thing I think you can learn from things like self-driving is they have these levels of autonomy that's a useful kind of framework in driving or these like maturity levels. People sort of skip to like level five, full autonomy, or we're going to have like an autonomous knowledge worker
We are in 🗽 NYC this Monday! Join the AI Eng NYC meetup, bring demos and vibes! It is a bit of a meme that the first thing developer tooling founders think to build in AI is all the non-AI operational stuff outside the AI. There are well over 60 funded LLM Ops startups all with hoping to solve the new observability, cost tracking, security, and reliability problems that come with putting LLMs in production, not to mention new LLM oriented products from incumbent, established ops/o11y players like Datadog and Weights & Biases. 2 years in to the current hype cycle, the early winners have tended to be people with practical/research AI backgrounds rather than MLOps heavyweights or SWE tourists: * LangSmith: We covered how Harrison Chase worked on AI at Robust Intelligence and Kensho, the alma maters of many great AI founders * HumanLoop: We covered how Raza Habib worked at Google AI during his PhD * BrainTrust: Today’s guest Ankur Goyal founded Impira pre-Transformers and was acquihired to run Figma AI before realizing how to solve the Ops problem. There have been many VC think pieces and market maps describing what people thought were the essential pieces of the AI Engineering stack, but what was true for 2022-2023 has aged poorly. The basic insight that Ankur had is the same thesis that Hamel Husain is pushing in his World’s Fair talk and podcast with Raza and swyx: Evals are the centerpiece of systematic AI Engineering. REALLY believing in this is harder than it looks with the benefit of hindsight. It’s not like people didn’t know evals were important. Basically every LLM Ops feature list has them. It’s an obvious next step AFTER managing your prompts and logging your LLM calls. In fact, up til we met Braintrust, we were working on an expanded version of the Impossible Triangle Theory of the LLM Ops War that we first articulated in the Humanloop writeup: The single biggest criticism of the Rise of the AI Engineer piece is that we neglected to split out the role of product evals (as opposed to model evals) in the now infamous “API line” chart: The AI SDLC With hindsight, we were very focused on the differentiating 0 to 1 phase that AI Engineers can bring to an existing team of ML engineers. As swyx says on the Day 2 keynote of AI Engineer, 2024 added a whole new set of concerns as AI Engineering grew up: A closer examination of Hamel’s product-oriented virtuous cycle and this infra-oriented SDLC would have eventually revealed that Evals, even more than logging, was the first point where teams start to get really serious about shipping to production, and therefore a great place to make an entry into the marketplace, which is exactly what Braintrust did. Also notice what’s NOT on this chart: shifting to shadow open source models, and finetuning them… per Ankur, Fine-tuning is not a viable standalone product: “The thing I would say is not debatable is whether or not fine-tuning is a business outcome or not. So let's think about the other components of your triangle. Ops/observability, that is a business… Frameworks, evals, databases [are a business, but] Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning, it is can I automatically optimize my use case to perform better if I throw data at the problem? And fine-tuning is one of multiple ways to achieve that.” OpenAI vs Open AI Market Share We last speculated about the market shifts in the End of OpenAI Hegemony and the Winds of AI Winter, and Ankur’s perspective is super valuable given his customer list: Some surprises based on what he is seeing: * Prior to Claude 3, OpenAI had near 100% market share. This tracks with what Harrison told us last year. * Claude 3.5 Sonnet and also notably Haiku have made serious dents * Open source model adoption is . Contra to Eugene Cheah’s ideal marketing pitch, virtually none of Braintrust’s customers are really finetuning open source models for cost, control, or privacy. This is partially caused by… * Open source model hosts, aka Inference providers, aren’t as mature as OpenAI’s API platform. Kudos to Michelle’s team as if they needed any more praise! * Adoption of Big Lab models via their Big Cloud Partners, aka Claude through AWS, or OpenAI through Azure, is low. Surprising! It seems that there are issues with accessing the latest models via the Cloud partners. swyx [01:36:51]: What % of your workload is open source? Ankur Goyal [01:36:55]: Because of how we're deployed, I don't have like an exact number for you. Among customers running in production, it's less than 5%. Full Video Episode Check out the Braintrust demo on YouTube! (and like and subscribe etc) Show Notes * Ankur’s companies * MemSQL/SingleStore → now Nikita Shamgunov of Neon * Impira * Braintrust * Papers mentioned * AlexNet * BERT Paper * Layout LM Paper * GPT-3 Paper * Voyager Paper * AI Engineer World's Fair * Ankur and Olmo’s talk at AIEWF * Together.ai * Fireworks * People * Nikita Shamgunov * Alana Goyal * Elad Gil * Clem Delangue * Guillermo Rauch * Prior episodes * HumanLoop episode * Michelle Pokrass episode * Dylan Patel episode Timestamps * [00:00:00] Introduction and background on Ankur career * [00:00:49] SingleStore and HTAP databases * [00:08:19] Founding Impira and lessons learned * [00:13:33] Unstructured vs Structured Data * [00:25:41] Overview of Braintrust and its features * [00:40:42] Industry observations and trends in AI tooling * [00:58:37] Workload types and AI use cases in production * [01:06:37] World's Fair AI conference discussion * [01:11:09] AI infrastructure market landscape * [01:24:59] OpenAI vs Anthropic vs other model providers * [01:38:11] GPU inference market discussion * [01:45:39] Hypothetical AI projects outside of Braintrust * [01:50:25] Potentially joining OpenAI * [01:52:37] Insights on effective networking and relationships in tech Transcript swyx [00:00:00]: Ankur Goyal, welcome to Latent Space. Ankur Goyal [00:00:06]: Thanks for having me. swyx [00:00:07]: Thanks for coming all the way over to our studio. Ankur Goyal [00:00:10]: It was a long hike. swyx [00:00:11]: A long trek. Yeah. You got T-boned by traffic. Yeah. You were the first VP of Eng at Signal Store. Yeah. Then you started Impira. You ran it for six years, got acquired into Figma, where you were at for eight months, and you just celebrated your one-year anniversary at Braintrust. I did, yeah. What a journey. I kind of want to go through each in turn because I have a personal relationship with Signal Store just because I have been a follower and fan of databases for a while. HTAP is always a dream of every database guy. It's still the dream. When HTAP, and Signal Store I think is the leading HTAP. Yeah. What's that journey like? And then maybe we'll cover the rest later. Ankur Goyal [00:00:49]: Sounds good. swyx [00:00:50]: We can start Signal Store first. Yeah, yeah. Ankur Goyal [00:00:52]: In college, as a first-generation Indian kid, I basically had two options. I had already told my parents I wasn't going to be a doctor. They're both doctors, so only two options left. Do a PhD or work at a big company. After my sophomore year, I worked at Microsoft, and it just wasn't for me. I realized that the work I was doing was impactful. I was working on Bing and the distributed compute infrastructure at Bing, which is actually now part of Azure. There were hundreds of engineers using the infrastructure that we were working on, but the level of intensity was too low. It felt like you got work-life balance and impact, but very little creativity, very little room to do interesting things. I was like, okay, let me cross that off the list. The only option left is to do research. I did research the next summer, and I realized, again, no one's working that hard. Maybe the times have changed, but at that point, there's a lot of creativity. You're just bouncing around fun ideas and working on stuff and really great work-life balance, but no one would actually use the stuff that we built, and that was not super energizing for me. I had this existential crisis, and I moved out to San Francisco because I had a friend who was here and crashed on his couch and was talking to him and just very, very confused. He said, you should talk to a recruiter, which felt like really weird advice. I'm not even sure I would give that advice to someone nowadays, but I met this really great guy named John, and he introduced me to like 30 different companies. I realized that there's actually a lot of interesting stuff happening in startups, and maybe I could find this kind of company that let me be very creative and work really hard and have a lot of impact, and I don't give a s**t about work-life balance. I talked to all these companies, and I remember I met MemSQL when it was three people and interviewed, and I thought I just totally failed the interview, but I had never had so much fun in my life. I remember I was at 10th and Harrison, and I stood at the bus station, and I called my parents and said, I'm sorry, I'm dropping out of school. I thought I wouldn't get the offer, but I just realized that if there's something like this company, then this is where I need to be. Luckily, things worked out, and I got an offer, and I joined as employee number two, and I worked there for almost six years, and it was an incredible experience. Learned a lot about systems, got to work with amazing customers. There are a lot of things that I took for granted that I later learned at Impira that I had taken for granted, and the most exciting thing is I got to run the engineering team, which was a great opportunity to learn about tech on a larger stage, recruit a lot of great people, and I think, for me personally, set me up to do a lot of interesting things after. swyx [00:03:41]: Yeah, there's so many ways I can take that. The most curious, I think, for general audiences is, is the dream real of S
We all have fond memories of the first Dev Day in 2023: and the blip that followed soon after. As Ben Thompson has noted, this year’s DevDay took a quieter, more intimate tone. No Satya, no livestream, (slightly fewer people?). Instead of putting ChatGPT announcements in DevDay as in 2023, o1 was announced 2 weeks prior, and DevDay 2024 was reserved purely for developer-facing API announcements, primarily the Realtime API, Vision Finetuning, Prompt Caching, and Model Distillation. However the larger venue and more spread out schedule did allow a lot more hallway conversations with attendees as well as more community presentations including our recent guest Alistair Pullen of Cosine as well as deeper dives from OpenAI including our recent guest Michelle Pokrass of the API Team. Thanks to OpenAI’s warm collaboration (we particularly want to thank Lindsay McCallum Rémy!), we managed to record exclusive interviews with many of the main presenters of both the keynotes and breakout sessions. We present them in full in today’s episode, together with a full lightly edited Q&A with Sam Altman. Show notes and related resources Some of these used in the final audio episode below * Simon Willison Live Blog * swyx live tweets and videos * Greg Kamradt coverage of Structured Output session, Scaling LLM Apps session * Fireside Chat Q&A with Sam Altman Timestamps * [00:00:00] Intro by Suno.ai * [00:01:23] NotebookLM Recap of DevDay * [00:09:25] Ilan's Strawberry Demo with Realtime Voice Function Calling * [00:19:16] Olivier Godement, Head of Product, OpenAI * [00:36:57] Romain Huet, Head of DX, OpenAI * [00:47:08] Michelle Pokrass, API Tech Lead at OpenAI ft. Simon Willison * [01:04:45] Alistair Pullen, CEO, Cosine (Genie) * [01:18:31] Sam Altman + Kevin Weill Q&A * [02:03:07] Notebook LM Recap of Podcast Transcript [00:00:00] Suno AI: Under dev daylights, code ignites. Real time voice streams reach new heights. O1 and GPT, 4. 0 in flight. Fine tune the future, data in sight. Schema sync up, outputs precise. Distill the models, efficiency splice. [00:00:33] AI Charlie: Happy October. This is your AI co host, Charlie. One of our longest standing traditions is covering major AI and ML conferences in podcast format. Delving, yes delving, into the vibes of what it is like to be there stitched in with short samples of conversations with key players, just to help you feel like you were there. [00:00:54] AI Charlie: Covering this year's Dev Day was significantly more challenging because we were all requested not to record the opening keynotes. So, in place of the opening keynotes, we had the viral notebook LM Deep Dive crew, my new AI podcast nemesis, Give you a seven minute recap of everything that was announced. [00:01:15] AI Charlie: Of course, you can also check the show notes for details. I'll then come back with an explainer of all the interviews we have for you today. Watch out and take care. [00:01:23] NotebookLM Recap of DevDay [00:01:23] NotebookLM: All right, so we've got a pretty hefty stack of articles and blog posts here all about open ais. Dev day 2024. [00:01:32] NotebookLM 2: Yeah, lots to dig into there. [00:01:34] NotebookLM 2: Seems [00:01:34] NotebookLM: like you're really interested in what's new with AI. [00:01:36] NotebookLM 2: Definitely. And it seems like OpenAI had a lot to announce. New tools, changes to the company. It's a lot. [00:01:43] NotebookLM: It is. And especially since you're interested in how AI can be used in the real world, you know, practical applications, we'll focus on that. [00:01:51] NotebookLM: Perfect. Like, for example, this Real time API, they announced that, right? That seems like a big deal if we want AI to sound, well, less like a robot. [00:01:59] NotebookLM 2: It could be huge. The real time API could completely change how we, like, interact with AI. Like, imagine if your voice assistant could actually handle it if you interrupted it. [00:02:08] NotebookLM: Or, like, have an actual conversation. [00:02:10] NotebookLM 2: Right, not just these clunky back and forth things we're used to. [00:02:14] NotebookLM: And they actually showed it off, didn't they? I read something about a travel app, one for languages. Even one where the AI ordered takeout. [00:02:21] NotebookLM 2: Those demos were really interesting, and I think they show how this real time API can be used in so many ways. [00:02:28] NotebookLM 2: And the tech behind it is fascinating, by the way. It uses persistent WebSocket connections and this thing called function calling, so it can respond in real time. [00:02:38] NotebookLM: So the function calling thing, that sounds kind of complicated. Can you, like, explain how that works? [00:02:42] NotebookLM 2: So imagine giving the AI Access to this whole toolbox, right? [00:02:46] NotebookLM 2: Information, capabilities, all sorts of things. Okay. So take the travel agent demo, for example. With function calling, the AI can pull up details, let's say about Fort Mason, right, from some database. Like nearby restaurants, stuff like that. [00:02:59] NotebookLM: Ah, I get it. So instead of being limited to what it already knows, It can go and find the information it needs, like a human travel agent would. [00:03:07] NotebookLM 2: Precisely. And someone on Hacker News pointed out a cool detail. The API actually gives you a text version of what's being said. So you can store that, analyze it. [00:03:17] NotebookLM: That's smart. It seems like OpenAI put a lot of thought into making this API easy for developers to use. But, while we're on OpenAI, you know, Besides their tech, there's been some news about, like, internal changes, too. [00:03:30] NotebookLM: Didn't they say they're moving away from being a non profit? [00:03:32] NotebookLM 2: They did. And it's got everyone talking. It's a major shift. And it's only natural for people to wonder how that'll change things for OpenAI in the future. I mean, there are definitely some valid questions about this move to for profit. Like, will they have more money for research now? [00:03:46] NotebookLM 2: Probably. But will they, you know, care as much about making sure AI benefits everyone? [00:03:51] NotebookLM: Yeah, that's the big question, especially with all the, like, the leadership changes happening at OpenAI too, right? I read that their Chief Research Officer left, and their VP of Research, and even their CTO. [00:04:03] NotebookLM 2: It's true. A lot of people are connecting those departures with the changes in OpenAI's structure. [00:04:08] NotebookLM: And I guess it makes you wonder what's going on behind the scenes. But they are still putting out new stuff. Like this whole fine tuning thing really caught my eye. [00:04:17] NotebookLM 2: Right, fine tuning. It's essentially taking a pre trained AI model. And, like, customizing it. [00:04:23] NotebookLM: So instead of a general AI, you get one that's tailored for a specific job. [00:04:27] NotebookLM 2: Exactly. And that opens up so many possibilities, especially for businesses. Imagine you could train an AI on your company's data, you know, like how you communicate your brand guidelines. [00:04:37] NotebookLM: So it's like having an AI that's specifically trained for your company? [00:04:41] NotebookLM 2: That's the idea. [00:04:41] NotebookLM: And they're doing it with images now, too, right? [00:04:44] NotebookLM: Fine tuning with vision is what they called it. [00:04:46] NotebookLM 2: It's pretty incredible what they're doing with that, especially in fields like medicine. [00:04:50] NotebookLM: Like using AI to help doctors make diagnoses. [00:04:52] NotebookLM 2: Exactly. And AI could be trained on thousands of medical images, right? And then it could potentially spot things that even a trained doctor might miss. [00:05:03] NotebookLM: That's kind of scary, to be honest. What if it gets it wrong? [00:05:06] NotebookLM 2: Well, the idea isn't to replace doctors, but to give them another tool, you know, help them make better decisions. [00:05:12] NotebookLM: Okay, that makes sense. But training these AI models must be really expensive. [00:05:17] NotebookLM 2: It can be. All those tokens add up. But OpenAI announced something called automatic prompt caching. [00:05:23] Alex Volkov: Automatic what now? I don't think I came across that. [00:05:26] NotebookLM 2: So basically, if your AI sees a prompt that it's already seen before, OpenAI will give you a discount. [00:05:31] NotebookLM: Huh. Like a frequent buyer program for AI. [00:05:35] NotebookLM 2: Kind of, yeah. It's good that they're trying to make it more affordable. And they're also doing something called model distillation. [00:05:41] NotebookLM: Okay, now you're just using big words to sound smart. What's that? [00:05:45] NotebookLM 2: Think of it like like a recipe, right? You can take a really complex recipe and break it down to the essential parts. [00:05:50] NotebookLM: Make it simpler, but it still tastes the same. [00:05:53] NotebookLM 2: Yeah. And that's what model distillation is. You take a big, powerful AI model and create a smaller, more efficient version. [00:06:00] NotebookLM: So it's like lighter weight, but still just as capable. [00:06:03] NotebookLM 2: Exactly. And that means more people can actually use these powerful tools. They don't need, like, a supercomputer to run them. [00:06:10] NotebookLM: So they're making AI more accessible. That's great. [00:06:13] NotebookLM 2: It is. And speaking of powerful tools, they also talked about their new O1 model. [00:06:18] NotebookLM 2: That's the one they've been hyping up. The one that's supposed to be this big leap forward. [00:06:22] NotebookLM: Yeah, O1. It sounds pretty futuristic. Like, from what I read, it's not just a bigger, better language model. [00:06:28] NotebookLM 2: Right. It's a different porch. [00:06:29] NotebookLM: They're saying it can, like, actually reason, right? Think. [00:06:33] NotebookLM 2: It's trained differently.
OpenAI DevDay is almost here! Per tradition, we are hosting a DevDay pregame event for everyone coming to town! Join us with demos and gossip! Also sign up for related events across San Francisco: the AI DevTools Night, the xAI open house, the Replicate art show, the DevDay Watch Party (for non-attendees), Hack Night with OpenAI at Cloudflare. For everyone else, join the Latent Space Discord for our online watch party and find fellow AI Engineers in your city. OpenAI’s recent o1 release (and Reflection 70b debacle) has reignited broad interest in agentic general reasoning and tree search methods. While we have covered some of the self-taught reasoning literature on the Latent Space Paper Club, it is notable that the Eric Zelikman ended up at xAI, whereas OpenAI’s hiring of Noam Brown and now Shunyu suggests more interest in tool-using chain of thought/tree of thought/generator-verifier architectures for Level 3 Agents. We were more than delighted to learn that Shunyu is a fellow Latent Space enjoyer, and invited him back (after his first appearance on our NeurIPS 2023 pod) for a look through his academic career with Harrison Chase (one year after his first LS show). ReAct: Synergizing Reasoning and Acting in Language Models paper link Following seminal Chain of Thought papers from Wei et al and Kojima et al, and reflecting on lessons from building the WebShop human ecommerce trajectory benchmark, Shunyu’s first big hit, the ReAct paper showed that using LLMs to “generate both reasoning traces and task-specific actions in an interleaved manner” achieved remarkably greater performance (less hallucination/error propagation, higher ALFWorld/WebShop benchmark success) than CoT alone. In even better news, ReAct scales fabulously with finetuning: As a member of the elite Princeton NLP group, Shunyu was also a coauthor of the Reflexion paper, which we discuss in this pod. Tree of Thoughts paper link here Shunyu’s next major improvement on the CoT literature was Tree of Thoughts: Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role… ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. The beauty of ToT is it doesnt require pretraining with exotic methods like backspace tokens or other MCTS architectures. You can listen to Shunyu explain ToT in his own words on our NeurIPS pod, but also the ineffable Yannic Kilcher: Other Work We don’t have the space to summarize the rest of Shunyu’s work, you can listen to our pod with him now, and recommend the CoALA paper and his initial hit webinar with Harrison, today’s guest cohost: as well as Shunyu’s PhD Defense Lecture: as well as Shunyu’s latest lecture covering a Brief History of LLM Agents: As usual, we are live on YouTube! Show Notes * Harrison Chase * LangChain, LangSmith, LangGraph * Shunyu Yao * Alec Radford * ReAct Paper * Hotpot QA * Tau Bench * WebShop * SWE-Agent * SWE-Bench * Trees of Thought * CoALA Paper * Related Episodes * Our Thomas Scialom (Meta) episode * Shunyu on our NeurIPS 2023 Best Papers episode * Harrison on our LangChain episode * Mentions * Sierra * Voyager * Jason Wei * Tavily * SERP API * Exa Timestamps * [00:00:00] Opening Song by Suno * [00:03:00] Introductions * [00:06:16] The ReAct paper * [00:12:09] Early applications of ReAct in LangChain * [00:17:15] Discussion of the Reflection paper * [00:22:35] Tree of Thoughts paper and search algorithms in language models * [00:27:21] SWE-Agent and SWE-Bench for coding benchmarks * [00:39:21] CoALA: Cognitive Architectures for Language Agents * [00:45:24] Agent-Computer Interfaces (ACI) and tool design for agents * [00:49:24] Designing frameworks for agents vs humans * [00:53:52] UX design for AI applications and agents * [00:59:53] Data and model improvements for agent capabilities * [01:19:10] TauBench * [01:23:09] Promising areas for AI Transcript Alessio [00:00:01]: Hey, everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO of Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI. Swyx [00:00:12]: Hey, and today we have a super special episode. I actually always wanted to take like a selfie and go like, you know, POV, you're about to revolutionize the world of agents because we have two of the most awesome hiring agents in the house. So first, we're going to welcome back Harrison Chase. Welcome. Excited to be here. What's new with you recently in sort of like the 10, 20 second recap? Harrison [00:00:34]: Linkchain, Linksmith, Lingraph, pushing on all of them. Lots of cool stuff related to a lot of the stuff that we're going to talk about today, probably. Swyx [00:00:42]: Yeah. Alessio [00:00:43]: We'll mention it in there. And the Celtics won the title. Swyx [00:00:45]: And the Celtics won the title. You got that going on for you. I don't know. Is that like floorball? Handball? Baseball? Basketball. Alessio [00:00:52]: Basketball, basketball. Harrison [00:00:53]: Patriots aren't looking good though, so that's... Swyx [00:00:56]: And then Xun Yu, you've also been on the pod, but only in like a sort of oral paper presentation capacity. But welcome officially to the LinkedSpace pod. Shunyu [00:01:03]: Yeah, I've been a huge fan. So thanks for the invitation. Thanks. Swyx [00:01:07]: Well, it's an honor to have you on. You're one of like, you're maybe the first PhD thesis defense I've ever watched in like this AI world, because most people just publish single papers, but every paper of yours is a banger. So congrats. Shunyu [00:01:22]: Thanks. Swyx [00:01:24]: Yeah, maybe we'll just kick it off with, you know, what was your journey into using language models for agents? I like that your thesis advisor, I didn't catch his name, but he was like, you know... Karthik. Yeah. It's like, this guy just wanted to use language models and it was such a controversial pick at the time. Right. Shunyu [00:01:39]: The full story is that in undergrad, I did some computer vision research and that's how I got into AI. But at the time, I feel like, you know, you're just composing all the GAN or 3D perception or whatever together and it's not exciting anymore. And one day I just see this transformer paper and that's really cool. But I really got into language model only when I entered my PhD and met my advisor Karthik. So he was actually the second author of GPT-1 when he was like a visiting scientist at OpenAI. With Alec Redford? Swyx [00:02:10]: Yes. Shunyu [00:02:11]: Wow. That's what he told me. It's like back in OpenAI, they did this GPT-1 together and Ilya just said, Karthik, you should stay because we just solved the language. But apparently Karthik is not fully convinced. So he went to Princeton, started his professorship and I'm really grateful. So he accepted me as a student, even though I have no prior knowledge in NLP. And you know, we just met for the first time and he's like, you know, what do you want to do? And I'm like, you know, you have done those test game scenes. That's really cool. I wonder if we can just redo them with language models. And that's how the whole journey began. Awesome. Alessio [00:02:46]: So GPT-2 was out at the time? Yes, that was 2019. Shunyu [00:02:48]: Yeah. Alessio [00:02:49]: Way too dangerous to release. And then I guess the first work of yours that I came across was React, which was a big part of your defense. But also Harrison, when you came on The Pockets last year, you said that was one of the first papers that you saw when you were getting inspired for BlankChain. So maybe give a recap of why you thought it was cool, because you were already working in AI and machine learning. And then, yeah, you can kind of like intro the paper formally. What was that interesting to you specifically? Harrison [00:03:16]: Yeah, I mean, I think the interesting part was using these language models to interact with the outside world in some form. And I think in the paper, you mostly deal with Wikipedia. And I think there's some other data sets as well. But the outside world is the outside world. And so interacting with things that weren't present in the LLM and APIs and calling into them and thinking about the React reasoning and acting and kind of like combining those together and getting better results. I'd been playing around with LLMs, been talking with people who were playing around with LLMs. People were trying to get LLMs to call into APIs, do things, and it was always, how can they do it more reliably and better? And so this paper was basically a step in that direction. And I think really interesting and also really general as well. Like I think that's part of the appeal is just how general and simple in a good way, I think the idea was. So that it was really appealing for all those reasons. Shunyu [00:04:07]: Simple is always good. Yeah. Alessio [00:04:09]: Do you have a favorite part? Because I have one favorite part from your PhD defense, which I didn't understand when I read the paper, but you said something along the lines, React doesn't change the outside or the environment, but it does change the insight through the context, putting more things in the context. You're not actually changing any of the tools around you to work for you, but you're changing how the model thinks. And I think that was like a very profound thing when I, not that I've been using these tools for like 18 months. I'm like, I understand what you meant, but like to say that at the time you did the PhD defense was not trivial. Yeah. Shunyu [00:04:41]:
Noah Hein from Latent Space University is finally launching with a free lightning course this Sunday for those new to AI Engineering. Tell a friend! Did you know there are >1,600 papers on arXiv just about prompting? Between shots, trees, chains, self-criticism, planning strategies, and all sorts of other weird names, it’s hard to keep up. Luckily for us, Sander Schulhoff and team read them all and put together The Prompt Report as the ultimate prompt engineering reference, which we’ll break down step-by-step in today’s episode. In 2022 swyx wrote “Why “Prompt Engineering” and “Generative AI” are overhyped”; the TLDR being that if you’re relying on prompts alone to build a successful products, you’re ngmi. Prompt engineering moved from being a stand-alone job to a core skill for AI Engineers now. We won’t repeat everything that is written in the paper, but this diagram encapsulates the state of prompting today: confusing. There are many similar terms, esoteric approaches that have doubtful impact on results, and lots of people that are just trying to create full papers around a single prompt just to get more publications out. Luckily, some of the best prompting techniques are being tuned back into the models themselves, as we’ve seen with o1 and Chain-of-Thought (see our OpenAI episode). Similarly, OpenAI recently announced 100% guaranteed JSON schema adherence, and Anthropic, Cohere, and Gemini all have JSON Mode (not sure if 100% guaranteed yet). No more “return JSON or my grandma is going to die” required. The next debate is human-crafted prompts vs automated approaches using frameworks like DSPy, which Sander recommended: I spent 20 hours prompt engineering for a task and DSPy beat me in 10 minutes. It’s much more complex than simply writing a prompt (and I’m not sure how many people usually spend >20 hours prompt engineering one task), but if you’re hitting a roadblock it might be worth checking out. Prompt Injection and Jailbreaks Sander and team also worked on HackAPrompt, a paper that was the outcome of an online challenge on prompt hacking techniques. They similarly created a taxonomy of prompt attacks, which is very hand if you’re building products with user-facing LLM interfaces that you’d like to test: In this episode we basically break down every category and highlight the overrated and underrated techniques in each of them. If you haven’t spent time following the prompting meta, this is a great episode to catchup! Full Video Episode Like and subscribe on YouTube! Timestamps * [00:00:00] Introductions - Intro music by Suno AI * [00:07:32] Navigating arXiv for paper evaluation * [00:12:23] Taxonomy of prompting techniques * [00:15:46] Zero-shot prompting and role prompting * [00:21:35] Few-shot prompting design advice * [00:28:55] Chain of thought and thought generation techniques * [00:34:41] Decomposition techniques in prompting * [00:37:40] Ensembling techniques in prompting * [00:44:49] Automatic prompt engineering and DSPy * [00:49:13] Prompt Injection vs Jailbreaking * [00:57:08] Multimodal prompting (audio, video) * [00:59:46] Structured output prompting * [01:04:23] Upcoming Hack-a-Prompt 2.0 project Show Notes * Sander Schulhoff * Learn Prompting * The Prompt Report * HackAPrompt * Mine RL Competition * EMNLP Conference * Noam Brown * Jordan Boydgraver * Denis Peskov * Simon Willison * Riley Goodside * David Ha * Jeremy Nixon * Shunyu Yao * Nicholas Carlini * Dreadnode Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:13]: Hey, and today we're in the remote studio with Sander Schulhoff, author of the Prompt Report. Sander [00:00:18]: Welcome. Thank you. Very excited to be here. Swyx [00:00:21]: Sander, I think I first chatted with you like over a year ago. What's your brief history? I went onto your website, it looks like you worked on diplomacy, which is really interesting because we've talked with Noam Brown a couple of times, and that obviously has a really interesting story in terms of prompting and agents. What's your journey into AI? Sander [00:00:40]: Yeah, I'd say it started in high school. I took my first Java class and just saw a YouTube video about something AI and started getting into it, reading. Deep learning, neural networks, all came soon thereafter. And then going into college, I got into Maryland and I emailed just like half the computer science department at random. I was like, hey, I want to do research on deep reinforcement learning because I've been experimenting with that a good bit. And over that summer, I had read the Intro to RL book and the deep reinforcement learning hands-on, so I was very excited about what deep RL could do. And a couple of people got back to me and one of them was Jordan Boydgraver, Professor Boydgraver, and he was working on diplomacy. And he said to me, this looks like it was more of a natural language processing project at the time, but it's a game, so very easily could move more into the RL realm. And I ended up working with one of his students, Denis Peskov, who's now a postdoc at Princeton. And that was really my intro to AI, NLP, deep RL research. And so from there, I worked on diplomacy for a couple of years, mostly building infrastructure for data collection and machine learning, but I always wanted to be doing it myself. So I had a number of side projects and I ended up working on the Mine RL competition, Minecraft reinforcement learning, also some people call it mineral. And that ended up being a really cool opportunity because I think like sophomore year, I knew I wanted to do some project in deep RL and I really liked Minecraft. And so I was like, let me combine these. And I was searching for some Minecraft Python library to control agents and found mineral. And I was trying to find documentation for how to build a custom environment and do all sorts of stuff. I asked in their Discord how to do this and their super responsive, very nice. And they're like, oh, you know, we don't have docs on this, but, you know, you can look around. And so I read through the whole code base and figured it out and wrote a PR and added the docs that I didn't have before. And then later I ended up joining their team for about a year. And so they maintain the library, but also run a yearly competition. That was my first foray into competitions. And I was still working on diplomacy. At some point I was working on this translation task between Dade, which is a diplomacy specific bot language and English. And I started using GPT-3 prompting it to do the translation. And that was, I think, my first intro to prompting. And I just started doing a bunch of reading about prompting. And I had an English class project where we had to write a guide on something that ended up being learn prompting. So I figured, all right, well, I'm learning about prompting anyways. You know, Chain of Thought was out at this point. There are a couple blog posts floating around, but there was no website you could go to just sort of read everything about prompting. So I made that. And it ended up getting super popular. Now continuing with it, supporting the project now after college. And then the other very interesting things, of course, are the two papers I wrote. And that is the prompt report and hack a prompt. So I saw Simon and Riley's original tweets about prompt injection go across my feed. And I put that information into the learn prompting website. And I knew, because I had some previous competition running experience, that someone was going to run a competition with prompt injection. And I waited a month, figured, you know, I'd participate in one of these that comes out. No one was doing it. So I was like, what the heck, I'll give it a shot. Just started reaching out to people. Got some people from Mila involved, some people from Maryland, and raised a good amount of sponsorship. I had no experience doing that, but just reached out to as many people as I could. And we actually ended up getting literally all the sponsors I wanted. So like OpenAI, actually, they reached out to us a couple months after I started learn prompting. And then Preamble is the company that first discovered prompt injection even before Riley. And they like responsibly disclosed it kind of internally to OpenAI. And having them on board as the largest sponsor was super exciting. And then we ran that, collected 600,000 malicious prompts, put together a paper on it, open sourced everything. And we took it to EMNLP, which is one of the top natural language processing conferences in the world. 20,000 papers were submitted to that conference, 5,000 papers were accepted. We were one of three selected as best papers at the conference, which was just massive. Super, super exciting. I got to give a talk to like a couple thousand researchers there, which was also very exciting. And I kind of carried that momentum into the next paper, which was the prompt report. It was kind of a natural extension of what I had been doing with learn prompting in the sense that we had this website bringing together all of the different prompting techniques, survey website in and of itself. So writing an actual survey, a systematic survey was the next step that we did in the prompt report. So over the course of about nine months, I led a 30 person research team with people from OpenAI, Google, Microsoft, Princeton, Stanford, Maryland, a number of other universities and companies. And we pretty much read thousands of papers on prompting and compiled it all into like a 80 page massive summary doc. And then we put it on archive and the response was amazing. We've gotten millions of views across socials. I actually put together a spreadsheet where I've been able to track about one and a half million. And I just kind
Congrats to Damien on successfully running AI Engineer London! See our community page and the Latent Space Discord for all upcoming events. This podcast came together in a far more convoluted way than usual, but happens to result in a tight 2 hours covering the ENTIRE OpenAI product suite across ChatGPT-latest, GPT-4o and the new o1 models, and how they are delivered to AI Engineers in the API via the new Structured Output mode, Assistants API, client SDKs, upcoming Voice Mode API, Finetuning/Vision/Whisper/Batch/Admin/Audit APIs, and everything else you need to know to be up to speed in September 2024. This podcast has two parts: the first hour is a regular, well edited, podcast on 4o, Structured Outputs, and the rest of the OpenAI API platform. The second was a rushed, noisy, hastily cobbled together recap of the top takeaways from the o1 model release from yesterday and today. Building AGI with Structured Outputs — Michelle Pokrass of OpenAI API team Michelle Pokrass built massively scalable platforms at Google, Stripe, Coinbase and Clubhouse, and now leads the API Platform at Open AI. She joins us today to talk about why structured output is such an important modality for AI Engineers that Open AI has now trained and engineered a Structured Output mode with 100% reliable JSON schema adherence. To understand why this is important, a bit of history is important: * June 2023 when OpenAI first added a "function calling" capability to GPT-4-0613 and GPT 3.5 Turbo 0613 (our podcast/writeup here) * November 2023’s OpenAI Dev Day (our podcast/writeup here) where the team shipped JSON Mode, a simpler schema-less JSON output mode that nevertheless became more popular because function calling often failed to match the JSON schema given by developers. * Meanwhile, in open source, many solutions arose, including * Instructor (our pod with Jason here) * LangChain (our pod with Harrison here, and he is returning next as a guest co-host) * Outlines (Remi Louf’s talk at AI Engineer here) * Llama.cpp’s constrained grammar sampling using GGML-BNF * April 2024: OpenAI started implementing constrained sampling with a new `tool_choice: required` parameter in the API * August 2024: the new Structured Output mode, co-led by Michelle * Sept 2024: Gemini shipped Structured Outputs as well We sat down with Michelle to talk through every part of the process, as well as quizzing her for updates on everything else the API team has shipped in the past year, from the Assistants API, to Prompt Caching, GPT4 Vision, Whisper, the upcoming Advanced Voice Mode API, OpenAI Enterprise features, and why every Waterloo grad seems to be a cracked engineer. Part 1 Timestamps and Transcript Transcript here. * [00:00:42] Episode Intro from Suno * [00:03:34] Michelle's Path to OpenAI * [00:12:20] Scaling ChatGPT * [00:13:20] Releasing Structured Output * [00:16:17] Structured Outputs vs Function Calling * [00:19:42] JSON Schema and Constrained Grammar * [00:20:45] OpenAI API team * [00:21:32] Structured Output Refusal Field * [00:24:23] ChatML issues * [00:26:20] Function Calling Evals * [00:28:34] Parallel Function Calling * [00:29:30] Increased Latency * [00:30:28] Prompt/Schema Caching * [00:30:50] Building Agents with Structured Outputs: from API to AGI * [00:31:52] Assistants API * [00:34:00] Use cases for Structured Output * [00:37:45] Prompting Structured Output * [00:39:44] Benchmarking Prompting for Structured Outputs * [00:41:50] Structured Outputs Roadmap * [00:43:37] Model Selection vs GPT4 Finetuning * [00:46:56] Is Prompt Engineering Dead? * [00:47:29] 2 models: ChatGPT Latest vs GPT 4o August * [00:50:24] Why API => AGI * [00:52:40] Dev Day * [00:54:20] Assistants API Roadmap * [00:56:14] Model Reproducibility/Determinism issues * [00:57:53] Tiering and Rate Limiting * [00:59:26] OpenAI vs Ops Startups * [01:01:06] Batch API * [01:02:54] Vision * [01:04:42] Whisper * [01:07:21] Voice Mode API * [01:08:10] Enterprise: Admin/Audit Log APIs * [01:09:02] Waterloo grads * [01:10:49] Books * [01:11:57] Cognitive Biases * [01:13:25] Are LLMs Econs? * [01:13:49] Hiring at OpenAI Emergency O1 Meetup — OpenAI DevRel + Strawberry team the following is our writeup from AINews, which so far stands the test of time. o1, aka Strawberry, aka Q*, is finally out! There are two models we can use today: o1-preview (the bigger one priced at $15 in / $60 out) and o1-mini (the STEM-reasoning focused distillation priced at $3 in/$12 out) - and the main o1 model is still in training. This caused a little bit of confusion. There are a raft of relevant links, so don’t miss: * the o1 Hub * the o1-preview blogpost * the o1-mini blogpost * the technical research blogpost * the o1 system card * the platform docs * the o1 team video and contributors list (twitter) Inline with the many, many leaks leading up to today, the core story is longer “test-time inference” aka longer step by step responses - in the ChatGPT app this shows up as a new “thinking” step that you can click to expand for reasoning traces, even though, controversially, they are hidden from you (interesting conflict of interest…): Under the hood, o1 is trained for adding new reasoning tokens - which you pay for, and OpenAI has accordingly extended the output token limit to >30k tokens (incidentally this is also why a number of API parameters from the other models like temperature and role and tool calling and streaming, but especially max_tokens is no longer supported). The evals are exceptional. OpenAI o1: * ranks in the 89th percentile on competitive programming questions (Codeforces), * places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), * and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). You are used to new models showing flattering charts, but there is one of note that you don’t see in many model announcements, that is probably the most important chart of all. Dr Jim Fan gets it right: we now have scaling laws for test time compute, and it looks like they scale loglinearly. We unfortunately may never know the drivers of the reasoning improvements, but Jason Wei shared some hints: Usually the big model gets all the accolades, but notably many are calling out the performance of o1-mini for its size (smaller than gpt 4o), so do not miss that. Part 2 Timestamps * [01:15:01] O1 transition * [01:16:07] O1 Meetup Recording * [01:38:38] OpenAI Friday AMA recap * [01:44:47] Q&A Part 2 * [01:50:28] O1 Demos This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
AI Engineering is expanding! Join the first 🇬🇧 AI Engineer London meetup in Sept and get in touch for sponsoring the second 🗽 AI Engineer Summit in NYC this Dec! The commoditization of intelligence takes on a few dimensions: * Time to Open Model Equivalent: 15 months between GPT-4 and Llama 3.1 405B * 10-100x CHEAPER/year: from $30/mtok for Claude 3 Opus to $3/mtok for L3-405B, and a 400x reduction in the frontier OpenAI model from 2022-2024. Notably, for personal use cases, both Gemini Flash and now Cerebras Inference offer 1m tokens/day inference free, causing the Open Model Red Wedding. * Alternatively you can observe the frontiers of various small/medium/large sizes of intelligence per dollar shift in realtime. 2024 has been particularly aggressive with almost 2 order-of-magnitude improvements in $/Elo points in the last 8 months. * 4-8x FASTER/year: The new Cerebras Inference platform runs 70B models at 450 tok/s, almost twice as fast as the Groq Cloud example that went viral earlier this year (and at $0.60/mtok to boot). James Wang says they have room to ”~8x throughput in the next few months”, which needs to be seen in reality and at scale, but is very exciting for downstream latency/throughput-sensitive usecases. Today’s guest, Nyla Worker, a senior PM at Nvidia, Convai, and now Google, and recently host of the GPUs & Inference track at the World’s Fair, was the first to point out to us that the kind of efficiency improvements that have become a predominant theme in LLMs in 2024, have been seen before in her career in computer vision. From her start at Ebay optimizing V100 inference for a ResNet-50 model for image search, she has watched many improvements like Multi-Inference GPU (allowing multiple instances with perfect hardware parallelism), Quantization Aware Training (most recently highlighted by Noam Shazeer pre Character AI departure) and Model Distillation (most recently highlighted by the Llama 3.1 paper) stacking with baseline hardware improvements (from V100s to A100s to H100s to GH200s) to produce theoretically 3000x faster inference now than 6 years ago. What Nyla saw in her career the last 6 years, is happening to LLMs today (not exactly repeating, but surely rhyming), specifically with LoRAs, native Int8 and even Ternary models, and teacher model distillation. We were excited to delve into all things efficiency in this episode and even come out the other side with bonus discussions on what generative AI can do for gaming, fanmade TV shows, character AI conversations, and even podcasting! Show Notes: * Nyla Linkedin, Twitter * Related Nvidia research * Improving INT8 Accuracy Using Quantization Aware Training and the NVIDIA TAO Toolkit * Nvidia Jetson Nano: Bringing the power of modern AI to millions of devices. * Synthetic Data with Nvidia Omniverse Replicator: Accelerate AI Training Faster Than Ever with New NVIDIA Omniverse Replicator Capabilities Timestamps * [00:00:00] Intro from Suno * [00:03:17] Nyla's path from Astrophysics to LLMs * [00:05:45] Efficiency Curves in Computer Vision at Nvidia * [00:09:51] Optimizing for today's hardware vs tomorrow's inference * [00:16:33] Quantization vs Precision tradeoff * [00:20:42] Hitting the Data Wall: The need for Synthetic Data at Nvidia * [00:26:20] Sora, text to 3D models, and Synthetic Data from Game Engines * [00:30:55] ResNet 50 keeps coming back * [00:35:40] Gaming Benchmarks * [00:38:00] FineWeb * [00:39:43] Traditional ML vs LLMs path to general intelligence * [00:42:33] ConvAI - AI NPCs * [00:45:32] Jensen and Lisa at Computex Taiwan * [00:52:51] NPCs need to take Actions and have Context * [00:54:29] Simulating different roles for training * [00:58:37] AI Generated Fan Content - Podcasts, TV Show, Einstein Transcripts [00:00:29] AI Charlie: Happy September. This is your AI co host, Charlie. [00:00:34] AI Charlie: One topic we've developed on LatentSpace is the importance of efficiency in all forms, from sample efficiency for spending limited training compute on limited data, and increasingly towards inference efficiency for increasingly demanding use cases like local LLMs, real time AI NPCs, and edge AI. However, we've never really developed any intuition for the trends and efficiency over time. [00:00:59] AI Charlie: For example, from 2020 to 2023, the price of GPT 3 level intelligence dropped from 60 per million tokens to 27 cents with the mixtural price war of December 2023. See show notes for charts and data. As for GPT 4 level intelligence, it took just over a year for GPT 4 to be matched by LLAMA370B and GPT 4 Turbo to be beaten by LLAMA3405B in open source, causing blended cost per million tokens to freefall from over 30 for Claude III Opus and the original GPT 4 down to under 3 for LLAMA3405B. [00:01:43] AI Charlie: Of course, OpenAI themselves have not stood still, slashing the price of GPT 4. 0 by 30 times with GPT 4. 0 Mini. Yes, you heard that right. GPT 4. 0 Mini is 3. 5 percent the price of GPT 4. 0, yet ties with GPT 4 Turbo on LM SYS. When the price of intelligence is falling by over 90 percent every year. What are the driving forces? [00:02:10] AI Charlie: And how should AI engineers plan for this? It turns out that this has happened before in computer vision, which has seen an almost 3, 000 times latency improvement over the last 6 years. We invited Nila Worker of NVIDIA and Convay. Who first made this comparison to help talk us through the past, present, and future use cases of efficient AI inference. [00:02:35] AI Charlie: Note that this was recorded before Naila joined Google AI to work on efficiency, so you can expect more great efficiency work coming from her on the Gemini team. In latent space news, look out for our upcoming London and NYC meetups on the community page, and of course feel free to start your own and simply let us know. [00:02:54] AI Charlie: Watch out and take care. [00:02:57] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co host Swyx, founder of Small. ai. [00:03:11] Hey, and today we are in the remote studio with Naila Worko. Welcome, Naila. Good to see you. [00:03:16] Nyla Worker: Good to see you all. [00:03:17] Nyla's path from Astrophysics to LLMs [00:03:17] swyx: So we try to introduce people based on sort of their professional profile and then let you fill in the blanks. [00:03:22] swyx: Um, so you did astrophysics research at Carleton College, uh, and then you made your way into machine learning. We're going to talk about your time at eBay, but most recently you spent four years at Nvidia, uh, working on everything from synthetic data to cloud container offerings. And now currently you're director of product management at Convai. [00:03:41] swyx: What should people know about you that maybe it's not super obvious on your LinkedIn that it's, you know. Encapsulates your life journey so far. [00:03:47] Nyla Worker: And yeah, I think the thing that is not very obvious is that transition from astrophysics research to AI and how that happens. So within astrophysics, what I was doing on my freshman year of college was categorizing whether this was a supernova Rembrandt or like an exoplanet. [00:04:06] Nyla Worker: And while that sounds all cool and incredible, it's literally looking at images of like Oxygen and sulfur and selecting manually each region. And it is extremely boring, shall I say. So I then found a paper from 1996, um, called Source Extractor, or like he called it Sextractor for some reason. And it was a multi layer perception network that had been trained on synthetic data. [00:04:38] Nyla Worker: To categorize whether this was a star or a galaxy, that led me to see that there was this massive optimization machine that when fed with right data, it could perform and automate tasks such as this kind of manual classification. That made me want to learn more. How do you train these things? How do you deploy them effectively? [00:05:00] Nyla Worker: And if it's useful for just classifying galaxies, what other applications are there out there where we show a bunch of data and just train these functions to just predict the next word in the case of LLMs or predict, uh, what is. Is this a cat or a dog and things like that. So then I went to computer vision research, particularly scaling the training of deep neural networks. [00:05:24] Nyla Worker: Back then I was using CPUs, doing it wrongly, of course. Uh, and then I went to eBay where I switched to GPUs, but I was working also on like the Jetsons and Edge devices. That is an interesting transition in how it all flows together. [00:05:41] swyx: We can talk about that and also how you transition from that into NVIDIA. [00:05:45] Efficiency Curves in Computer Vision at Nvidia [00:05:45] swyx: But like, yeah, a lot of the podcasts for today, we're actually talking about efficiency and efficiency curves over time. And The reason I invited you to this pod was I was basically looking for somebody to talk about this. And you came at this with your insight on how like this already happens with computer vision, right? [00:06:06] swyx: This sort of efficiency curve over time. So I wonder if you want to just comment about Just set the context for like what has happened in your career that you've seen already. [00:06:15] Nyla Worker: When I started was first scaling up training and making training more efficient. And that of course has evolved significantly over time. [00:06:22] Nyla Worker: There is a lot on training. But what I discovered is that if these things are truly useful, you should be obsessing about inference. And then I went to eBay, uh, where I was in their hardware team, but I was doing software optimizations for the hardware team, such that the research that had been done for the AI research team was actually running efficiently on the hardware. [00:06:45] Nyla Worker: And there, I started leveragin
Today's guest, Nicholas Carlini, a research scientist at DeepMind, argues that we should be focusing more on what AI can do for us individually, rather than trying to have an answer for everyone. "How I Use AI" - A Pragmatic Approach Carlini's blog post "How I Use AI" went viral for good reason. Instead of giving a personal opinion about AI's potential, he simply laid out how he, as a security researcher, uses AI tools in his daily work. He divided it in 12 sections: * To make applications * As a tutor * To get started * To simplify code * For boring tasks * To automate tasks * As an API reference * As a search engine * To solve one-offs * To teach me * Solving solved problems * To fix errors Each of the sections has specific examples, so we recommend going through it. It also includes all prompts used for it; in the "make applications" case, it's 30,000 words total! My personal takeaway is that the majority of the work AI can do successfully is what humans dislike doing. Writing boilerplate code, looking up docs, taking repetitive actions, etc. These are usually boring tasks with little creativity, but with a lot of structure. This is the strongest arguments as to why LLMs, especially for code, are more beneficial to senior employees: if you can get the boring stuff out of the way, there's a lot more value you can generate. This is less and less true as you go entry level jobs which are mostly boring and repetitive tasks. Nicholas argues both sides ~21:34 in the pod. A New Approach to LLM Benchmarks We recently did a Benchmarks 201 episode, a follow up to our original Benchmarks 101, and some of the issues have stayed the same. Notably, there's a big discrepancy between what benchmarks like MMLU test, and what the models are used for. Carlini created his own domain-specific language for writing personalized LLM benchmarks. The idea is simple but powerful: * Take tasks you've actually needed AI for in the past. * Turn them into benchmark tests. * Use these to evaluate new models based on your specific needs. It can represent very complex tasks, from a single code generation to drawing a US flag using C: "Write hello world in python" >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world") "Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \ VisionLLMRun("What flag is shown in this image?") >> \ (SubstringEvaluator("United States") | SubstringEvaluator("USA"))) This approach solves a few problems: * It measures what's actually useful to you, not abstract capabilities. * It's harder for model creators to "game" your specific benchmark, a problem that has plagued standardized tests. * It gives you a concrete way to decide if a new model is worth switching to, similar to how developers might run benchmarks before adopting a new library or framework. Carlini argues that if even a small percentage of AI users created personal benchmarks, we'd have a much better picture of model capabilities in practice. AI Security While much of the AI security discussion focuses on either jailbreaks or existential risks, Carlini's research targets the space in between. Some highlights from his recent work: * LAION 400M data poisoning: By buying expired domains referenced in the dataset, Carlini's team could inject arbitrary images into models trained on LAION 400M. You can read the paper "Poisoning Web-Scale Training Datasets is Practical", for all the details. This is a great example of expanding the scope beyond the model itself, and looking at the whole system and how ti can become vulnerable. * Stealing model weights: They demonstrated how to extract parts of production language models (like OpenAI's) through careful API queries. This research, "Extracting Training Data from Large Language Models", shows that even black-box access can leak sensitive information. * Extracting training data: In some cases, they found ways to make models regurgitate verbatim snippets from their training data. Him and Milad Nasr wrote a paper on this as well: Scalable Extraction of Training Data from (Production) Language Models. They also think this might be applicable to extracting RAG results from a generation. These aren't just theoretical attacks. They've led to real changes in how companies like OpenAI design their APIs and handle data. If you really miss logit_bias and logit results by token, you can blame Nicholas :) We had a ton of fun also chatting about things like Conway's Game of Life, how much data can fit in a piece of paper, and porting Doom to Javascript. Enjoy! Show Notes * How I Use AI * My Benchmark for LLMs * Doom Javascript port * Conway's Game of Life * Tic-Tac-Toe in one printf statement * International Obfuscated C Code Contest * Cursor * LAION 400M poisoning paper * Man vs Machine at Black Hat * Model Stealing from OpenAI * Milad Nasr * H.D. Moore * Vijay Bolina * Cosine.sh * uuencode Timestamps * [00:00:00] Introductions * [00:01:14] Why Nicholas writes * [00:02:09] The Game of Life * [00:05:07] "How I Use AI" blog post origin story * [00:08:24] Do we need software engineering agents? * [00:11:03] Using AI to kickstart a project * [00:14:08] Ephemeral software * [00:17:37] Using AI to accelerate research * [00:21:34] Experts vs non-expert users as beneficiaries of AI * [00:24:02] Research on generating less secure code with LLMs. * [00:27:22] Learning and explaining code with AI * [00:30:12] AGI speculations? * [00:32:50] Distributing content without social media * [00:35:39] How much data do you think you can put on a single piece of paper? * [00:37:37] Building personal AI benchmarks * [00:43:04] Evolution of prompt engineering and its relevance * [00:46:06] Model vs task benchmarking * [00:52:14] Poisoning LAION 400M through expired domains * [00:55:38] Stealing OpenAI models from their API * [01:01:29] Data stealing and recovering training data from models * [01:03:30] Finding motivation in your work Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:12]: Hey, and today we're in the in-person studio, which Alessio has gorgeously set up for us, with Nicholas Carlini. Welcome. Thank you. You're a research scientist at DeepMind. You work at the intersection of machine learning and computer security. You got your PhD from Berkeley in 2018, and also your BA from Berkeley as well. And mostly we're here to talk about your blogs, because you are so generous in just writing up what you know. Well, actually, why do you write? Nicholas [00:00:41]: Because I like, I feel like it's fun to share what you've done. I don't like writing, sufficiently didn't like writing, I almost didn't do a PhD, because I knew how much writing was involved in writing papers. I was terrible at writing when I was younger. I do like the remedial writing classes when I was in university, because I was really bad at it. So I don't actually enjoy, I still don't enjoy the act of writing. But I feel like it is useful to share what you're doing, and I like being able to talk about the things that I'm doing that I think are fun. And so I write because I think I want to have something to say, not because I enjoy the act of writing. Swyx [00:01:14]: But yeah. It's a tool for thought, as they often say. Is there any sort of backgrounds or thing that people should know about you as a person? Yeah. Nicholas [00:01:23]: So I tend to focus on, like you said, I do security work, I try to like attacking things and I want to do like high quality security research. And that's mostly what I spend my actual time trying to be productive members of society doing that. But then I get distracted by things, and I just like, you know, working on random fun projects. Like a Doom clone in JavaScript. Swyx [00:01:44]: Yes. Nicholas [00:01:45]: Like that. Or, you know, I've done a number of things that have absolutely no utility. But are fun things to have done. And so it's interesting to say, like, you should work on fun things that just are interesting, even if they're not useful in any real way. And so that's what I tend to put up there is after I have completed something I think is fun, or if I think it's sufficiently interesting, write something down there. Alessio [00:02:09]: Before we go into like AI, LLMs and whatnot, why are you obsessed with the game of life? So you built multiplexing circuits in the game of life, which is mind boggling. So where did that come from? And then how do you go from just clicking boxes on the UI web version to like building multiplexing circuits? Nicholas [00:02:29]: I like Turing completeness. The definition of Turing completeness is a computer that can run anything, essentially. And the game of life, Conway's game of life is a very simple cellular 2D automata where you have cells that are either on or off. And a cell becomes on if in the previous generation some configuration holds true and off otherwise. It turns out there's a proof that the game of life is Turing complete, that you can run any program in principle using Conway's game of life. I don't know. And so you can, therefore someone should. And so I wanted to do it. Some other people have done some similar things, but I got obsessed into like, if you're going to try and make it work, like we already know it's possible in theory. I want to try and like actually make something I can run on my computer, like a real computer I can run. And so yeah, I've been going on this rabbit hole of trying to make a CPU that I can run semi real time on the game of life. And I have been making some reasonable progress there. And yeah, but you know, Turing completeness is just like a very fun trap you can go down. A while ago, as part of a research paper, I was able to show that in C, if you call into printf, it's Turing complete. Like printf, you know, like, which like, you know, you can print num
Betteridge's law says no: with seemingly infinite flavors of RAG, and >2million token context + prompt caching from Anthropic/Deepmind/Deepseek, it's reasonable to believe that "in context learning is all you need". But then there’s Cosine Genie, the first to make a huge bet using OpenAI’s new GPT4o fine-tuning for code at the largest scale it has ever been used externally; resulting in what is now the #1 coding agent in the world according to SWE-Bench Full, Lite, and Verified: SWE-Bench has been the most successful agent benchmark of the year, receiving honors at ICLR (our interview here) and recently being verified by OpenAI. Cognition (Devin) was valued at $2b after reaching 14% on it. So it is very, very big news when a new agent appears to beat all other solutions, by a lot: While this number is self reported, it seems to be corroborated by OpenAI, who also award it clear highest marks on SWE-Bench verified: The secret is GPT-4o finetuning on billions of tokens of synthetic data. * Finetuning: As OpenAI says: Genie is powered by a fine-tuned GPT-4o model trained on examples of real software engineers at work, enabling the model to learn to respond in a specific way. The model was also trained to be able to output in specific formats, such as patches that could be committed easily to codebases. Due to the scale of Cosine’s finetuning, OpenAI worked closely with them to figure out the size of the LoRA: “They have to decide how big your LoRA adapter is going to be… because if you had a really sparse, large adapter, you’re not going to get any signal in that at all. So they have to dynamically size these things.” * Synthetic data: we need to finetune on the process of making code work instead of only training on working code. “…we synthetically generated runtime errors. Where we would intentionally mess with the AST to make stuff not work, or index out of bounds, or refer to a variable that doesn't exist, or errors that the foundational models just make sometimes that you can't really avoid, you can't expect it to be perfect.” Genie also has a 4 stage workflow with the standard LLM OS tooling stack that lets it solve problems iteratively: Full Video Pod like and subscribe etc! Show Notes * Alistair Pullen - Twitter, Linkedin * Cosine Genie launch, technical report * OpenAI GPT-4o finetuning GA * Llama 3 backtranslation * Cursor episode and Aman + SWEBench at ICLR episode Timestamps * [00:00:00] Suno Intro * [00:05:01] Alistair and Cosine intro * [00:16:34] GPT4o finetuning * [00:20:18] Genie Data Mix * [00:23:09] Customizing for Customers * [00:25:37] Genie Workflow * [00:27:41] Code Retrieval * [00:35:20] Planning * [00:42:29] Language Mix * [00:43:46] Running Code * [00:46:19] Finetuning with OpenAI * [00:49:32] Synthetic Code Data * [00:51:54] SynData in Llama 3 * [00:52:33] SWE-Bench Submission Process * [00:58:20] Future Plans * [00:59:36] Ecosystem Trends * [01:00:55] Founder Lessons * [01:01:58] CTA: Hiring & Customers Descript Transcript [00:01:52] AI Charlie: Welcome back. This is Charlie, your AI cohost. As AI engineers, we have a special focus on coding agents, fine tuning, and synthetic data. And this week, it all comes together with the launch of Cosign's Genie, which reached 50 percent on SWE Bench Lite, 30 percent on the full SWE Bench, and 44 percent on OpenAI's new SWE Bench Verified. [00:02:17] All state of the art results by the widest ever margin recorded compared to former leaders Amazon Q and US Autocode Rover. And Factory Code Droid. As a reminder, Cognition Devon went viral with a 14 percent score just five months ago. Cosign did this by working closely with OpenAI to fine tune GPT 4. 0, now generally available to you and me, on billions of tokens of code, much of which was synthetically generated. [00:02:47] Alistair Pullen: Hi, I'm Ali. Co founder and CEO of Cosign, a human reasoning lab. And I'd like to show you Genie, our state of the art, fully autonomous software engineering colleague. Genie has the highest score on SWBench in the world. And the way we achieved this was by taking a completely different approach. We believe that if you want a model to behave like a software engineer, it has to be shown how a human software engineer works. [00:03:15] We've designed new techniques to derive human reasoning from real examples of software engineers doing their jobs. Our data represents perfect information lineage, incremental knowledge discovery, and step by step decision making. Representing everything a human engineer does logically. By actually training Genie on this unique dataset, rather than simply prompting base models, which is what everyone else is doing, we've seen that we're no longer simply generating random code until some works. [00:03:46] It's tackling problems like [00:03:48] AI Charlie: a human. Alistair Pullen is CEO and co founder of Kozen, and we managed to snag him on a brief trip stateside for a special conversation on building the world's current number one coding agent. Watch out and take care. [00:04:07] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Resonance at Decibel Partners, and I'm joined by my co host Swyx, founder of Small. ai. [00:04:16] swyx: Hey, and today we're back in the studio. In person, after about three to four months in visa jail and travels and all other fun stuff that we talked about in the previous episode. [00:04:27] But today we have a special guest, Ali Pullen from Cosign. Welcome. Hi, thanks for having me. We're very lucky to have you because you're on a two day trip to San Francisco. Yeah, I wouldn't recommend it. I would not [00:04:38] Alistair Pullen: recommend it. Don't fly from London to San Francisco for two days. [00:04:40] swyx: And you launched Genie on a plane. [00:04:42] On plain Wi Fi, um, claiming state of the art in SuiteBench, which we're all going to talk about. I'm excited to dive into your whole journey, because it has been a journey. I've been lucky to be a small angel in part of that journey. And it's exciting to see that you're launching to such acclaim and, you know, such results. [00:05:01] Alistair and Cosine intro [00:05:01] swyx: Um, so I'll go over your brief background, and then you can sort of fill in the blanks on what else people should know about you. You did your bachelor's in computer science at Exeter. [00:05:10] Speaker 6: Yep. [00:05:10] swyx: And then you worked at a startup that got acquired into GoPuff and round about 2022, you started working on a stealth startup that became a YC startup. [00:05:19] What's that? Yeah. So [00:05:21] Alistair Pullen: basically when I left university, I, I met my now co founder, Sam. At the time we were both mobile devs. He was an Android developer. iOS developer. And whilst at university, we built this sort of small consultancy, sort of, we'd um, be approached to build projects for people and we would just take them up and start with, they were student projects. [00:05:41] They weren't, they weren't anything crazy or anything big. We started with those and over time we started doing larger and larger projects, more interesting things. And then actually, when we left university, we just kept doing that. We didn't really get jobs, traditional jobs. It was also like in the middle of COVID, middle of lockdown. [00:05:57] So we were like, this is a pretty good gig. We'll just keep like writing code in our bedrooms. And yeah, that's it. We did that for a while. And then a friend of ours that we went to Exeter with started a YC startup during COVID. And it was one of these fast grocery delivery companies. At the time I was living in the deepest, darkest countryside in England, where fast grocery companies are still not a thing. [00:06:20] So he, he sort of pitched me this idea and was like, listen, like I need an iOS dev, do you fancy coming along? And I thought, absolutely. It was a chance to get out of my parents house, chance to move to London, you know, do interesting things. And at the time, truthfully, I had no idea what YC was. I had no idea. [00:06:34] I wasn't in the startup space. I knew I liked coding and building apps and stuff, but I'd never, never really done anything in that area. So I said, yes, absolutely. I moved to London just sort of as COVID was ending and yeah, worked at what was fancy for about a year and a half. Then we brought Sam along as well. [00:06:52] So we, Sam and I, were the two engineers at Fancy for basically its entire life, and we built literally everything. So like the, the front, the client mobile apps, the, the backends, the internal like stock management system, the driver routing, algorithms, all those things. Literally like everything. It was my first. [00:07:12] You know, both of us were super inexperienced. We didn't have, like, proper engineering experience. There were definitely decisions we'd do differently now. We'd definitely buy a lot of stuff off the shelf, stuff like that. But it was the initial dip of the toe into, like, the world of startups, and we were both, like, hooked immediately. [00:07:26] We were like, this is so cool. This sounds so much better than all our friends who were, like, consultants and doing, like, normal jobs, right? We did that, and it ran its course, and after, I want to say, 18 months or so, GoPuff came and acquired us. And there was obviously a transitionary period, an integration period, like with all acquisitions, and we did that, and as soon as we'd vested what we wanted to vest, and as soon as we thought, okay, this chapter is sort of done, uh, in about 2022, We left and we knew that we wanted to go alone and try something like we'd had this taste. [00:07:54] Now we knew we'd seen how a like a YC startup was managed like up close and we knew that we wanted to do something similar ourselves. We had no idea what it was at the time. We just knew we wanted to do something. So we, we tried a small, um, some small
Disclaimer: We recorded this episode ~1.5 months ago, timing for the FastHTML release. It then got bottlenecked by Llama3.1, Winds of AI Winter, and SAM2 episodes, so we’re a little late. Since then FastHTML was released, swyx is building an app in it for AINews, and Anthropic has also released their prompt caching API. Remember when Dylan Patel of SemiAnalysis coined the GPU Rich vs GPU Poor war? (if not, see our pod with him). The idea was that if you’re GPU poor you shouldn’t waste your time trying to solve GPU rich problems (i.e. pre-training large models) and are better off working on fine-tuning, optimized inference, etc. Jeremy Howard (see our “End of Finetuning” episode to catchup on his background) and Eric Ries founded Answer.AI to do exactly that: “Practical AI R&D”, which is very in-line with the GPU poor needs. For example, one of their first releases was a system based on FSDP + QLoRA that let anyone train a 70B model on two NVIDIA 4090s. Since then, they have come out with a long list of super useful projects (in no particular order, and non-exhaustive): * FSDP QDoRA: this is just as memory efficient and scalable as FSDP/QLoRA, and critically is also as accurate for continued pre-training as full weight training. * Cold Compress: a KV cache compression toolkit that lets you scale sequence length without impacting speed. * colbert-small: state of the art retriever at only 33M params * JaColBERTv2.5: a new state-of-the-art retrievers on all Japanese benchmarks. * gpu.cpp: portable GPU compute for C++ with WebGPU. * Claudette: a better Anthropic API SDK. They also recently released FastHTML, a new way to create modern interactive web apps. Jeremy recently released a 1 hour “Getting started” tutorial on YouTube; while this isn’t AI related per se, but it’s close to home for any AI Engineer who are looking to iterate quickly on new products: In this episode we broke down 1) how they recruit 2) how they organize what to research 3) and how the community comes together. At the end, Jeremy gave us a sneak peek at something new that he’s working on that he calls dialogue engineering: So I've created a new approach. It's not called prompt engineering. I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it. He explains it a bit more ~44:53 in the pod, but we’ll just have to wait for the public release to figure out exactly what he means. Timestamps * [00:00:00] Intro by Suno AI * [00:03:02] Continuous Pre-Training is Here * [00:06:07] Schedule-Free Optimizers and Learning Rate Schedules * [00:07:08] Governance and Structural Issues within OpenAI and Other AI Labs * [00:13:01] How Answer.ai works * [00:23:40] How to Recruit Productive Researchers * [00:27:45] Building a new BERT * [00:31:57] FSDP, QLoRA, and QDoRA: Innovations in Fine-Tuning Large Models * [00:36:36] Research and Development on Model Inference Optimization * [00:39:49] FastHTML for Web Application Development * [00:46:53] AI Magic & Dialogue Engineering * [00:52:19] AI wishlist & predictions Show Notes * Jeremy Howard * Previously on Latent Space: The End of Finetuning, NeurIPS Startups * Answer.ai * Fast.ai * FastHTML * answerai-colbert-small-v1 * gpu.cpp * Eric Ries * Aaron DeFazio * Yi Tai * Less Wright * Benjamin Warner * Benjamin Clavié * Jono Whitaker * Austin Huang * Eric Gilliam * Tim Dettmers * Colin Raffel * Mark Saroufim * Sebastian Raschka * Carson Gross * Simon Willison * Sepp Hochreiter * Llama3.1 episode * Snowflake Arctic * Ranger Optimizer * Gemma.cpp * HTMX * UL2 * BERT * DeBERTa * Efficient finetuning of Llama 3 with FSDP QDoRA * xLSTM Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:14]: And today we're back with Jeremy Howard, I think your third appearance on Latent Space. Welcome. Jeremy [00:00:19]: Wait, third? Second? Swyx [00:00:21]: Well, I grabbed you at NeurIPS. Jeremy [00:00:23]: I see. Swyx [00:00:24]: Very fun, standing outside street episode. Jeremy [00:00:27]: I never heard that, by the way. You've got to send me a link. I've got to hear what it sounded like. Swyx [00:00:30]: Yeah. Yeah, it's a NeurIPS podcast. Alessio [00:00:32]: I think the two episodes are six hours, so there's plenty to listen, we'll make sure to send it over. Swyx [00:00:37]: Yeah, we're trying this thing where at the major ML conferences, we, you know, do a little audio tour of, give people a sense of what it's like. But the last time you were on, you declared the end of fine tuning. I hope that I sort of editorialized the title a little bit, and I know you were slightly uncomfortable with it, but you just own it anyway. I think you're very good at the hot takes. And we were just discussing in our pre-show that it's really happening, that the continued pre-training is really happening. Jeremy [00:01:02]: Yeah, absolutely. I think people are starting to understand that treating the three ULM FIT steps of like pre-training, you know, and then the kind of like what people now call instruction tuning, and then, I don't know if we've got a general term for this, DPO, RLHFE step, you know, or the task training, they're not actually as separate as we originally suggested they were in our paper, and when you treat it more as a continuum, and that you make sure that you have, you know, more of kind of the original data set incorporated into the later stages, and that, you know, we've also seen with LLAMA3, this idea that those later stages can be done for a lot longer. These are all of the things I was kind of trying to describe there. It wasn't the end of fine tuning, but more that we should treat it as a continuum, and we should have much higher expectations of how much you can do with an already trained model. You can really add a lot of behavior to it, you can change its behavior, you can do a lot. So a lot of our research has been around trying to figure out how to modify the model by a larger amount rather than starting from random weights, because I get very offended at the idea of starting from random weights. Swyx [00:02:14]: Yeah, I saw that in ICLR in Vienna, there was an outstanding paper about starting transformers from data-driven piers. I don't know if you saw that one, they called it sort of never trained from scratch, and I think it was kind of rebelling against like the sort of random initialization. Jeremy [00:02:28]: Yeah, I've, you know, that's been our kind of continuous message since we started Fast AI, is if you're training for random weights, you better have a really good reason, you know, because it seems so unlikely to me that nobody has ever trained on data that has any similarity whatsoever to the general class of data you're working with, and that's the only situation in which I think starting from random weights makes sense. Swyx [00:02:51]: The other trends since our last pod that I would point people to is I'm seeing a rise in multi-phase pre-training. So Snowflake released a large model called Snowflake Arctic, where they detailed three phases of training where they had like a different mixture of like, there was like 75% web in the first instance, and then they reduced the percentage of the web text by 10% each time and increased the amount of code in each phase. And I feel like multi-phase is being called out in papers more. I feel like it's always been a thing, like changing data mix is not something new, but calling it a distinct phase is new, and I wonder if there's something that you're seeing Jeremy [00:03:32]: on your end. Well, so they're getting there, right? So the point at which they're doing proper continued pre-training is the point at which that becomes a continuum rather than a phase. So the only difference with what I was describing last time is to say like, oh, there's a function or whatever, which is happening every batch. It's not a huge difference. You know, I always used to get offended when people had learning rates that like jumped. And so one of the things I started doing early on in Fast.ai was to say to people like, no, you should actually have your learning rate schedule should be a function, not a list of numbers. So now I'm trying to give the same idea about training mix. Swyx [00:04:07]: There's been pretty public work from Meta on schedule-free optimizers. I don't know if you've been following Aaron DeFazio and what he's doing, just because you mentioned learning rate schedules, you know, what if you didn't have a schedule? Jeremy [00:04:18]: I don't care very much, honestly. I don't think that schedule-free optimizer is that exciting. It's fine. We've had non-scheduled optimizers for ages, like Less Wright, who's now at Meta, who was part of the Fast.ai community there, created something called the Ranger optimizer. I actually like having more hyperparameters. You know, as soon as you say schedule-free, then like, well, now I don't get to choose. And there isn't really a mathematically correct way of, like, I actually try to schedule more parameters rather than less. So like, I like scheduling my epsilon in my atom, for example. I schedule all the things. But then the other thing we always did with the Fast.ai library was make it so you don't have to set any schedules. So Fast.ai always supported, like, you didn't even have to pass a learning rate. Like, it would always just try to have good defaults and do the right thing. But to me, I like to have more parameters I can play with if I want to, but you don't have to. Alessio [00:05:08]: And then the more less technical side, I guess, of your issue, I guess, with the market was some of the large research labs taking all this innovation kind of behind closed doors and whether or not that's goo
Because of the nature of SAM, this is more video heavy than usual. See our YouTube! Because vision is first among equals in multimodality, and yet SOTA vision language models are closed, we’ve always had an interest in learning what’s next in vision. Our first viral episode was Segment Anything 1, and we have since covered LLaVA, IDEFICS, Adept, and Reka. But just like with Llama 3, FAIR holds a special place in our hearts as the New Kings of Open Source AI. The list of sequels better than the originals is usually very short, but SAM 2 delighted us by not only being a better image segmentation model than SAM 1, it also conclusively and inexpensively solved video segmentation in just an elegant a way as SAM 1 did for images, and releasing everything to the community as Apache 2/CC by 4.0. “In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM).” Surprisingly Efficient The paper reports that SAM 2 was trained on 256 A100 GPUs for 108 hours (59% more than SAM 1). Taking the upper end $2 A100 cost off gpulist.ai means SAM2 cost ~$50k to train if it had an external market-rate cost - surprisingly cheap for adding video understanding! The newly released SA-V dataset is also the largest video segment dataset to date, with careful attention given to scene/object/geographical diversity, including that of annotators. In some ways, we are surprised that SOTA video segmentation can be done on only ~50,000 videos (and 640k masklet annotations). Model-in-the-loop Data Engine for Annotations and Demo-first Development Similar to SAM 1, a 3 Phase Data Engine helped greatly in bootstrapping this dataset. As Nikhila says in the episode, the demo you see wasn’t just for show, they actually used this same tool to do annotations for the model that is now demoed in the tool: “With the original SAM, we put a lot of effort in building a high-quality demo. And the other piece here is that the demo is actually the annotation tool. So we actually use the demo as a way to improve our annotation tool. And so then it becomes very natural to invest in building a good demo because it speeds up your annotation. and improve the data quality, and that will improve the model quality. With this approach, we found it to be really successful.” An incredible 90% speedup in annotation happened due to this virtuous cycle which helped SA-V reach this incredible scale. Building the demo also helped the team live the context that their own downstream users, like Roboflow, would experience, and forced them to make choices accordingly. As Nikhila says: “It's a really encouraging trend for not thinking about only the new model capability, but what sort of applications folks want to build with models as a result of that downstream. I think it also really forces you to think about many things that you might postpone. For example, efficiency. For a good demo experience, making it real time is super important. No one wants to wait. And so it really forces you to think about these things much sooner and actually makes us think about what kind of image encoder we want to use or other things. hardware efficiency improvements. So those kind of things, I think, become a first-class citizen when you put the demo first.” Indeed, the team swapped out standard ViT-H Vision Transformers for Hiera (Hierarchical) Vision Transformers as a result of efficiency considerations. Memory Attention Speaking of architecture, the model design is probably the sleeper hit of a project filled with hits. The team adapted SAM 1 to video by adding streaming memory for real-time video processing: Specifically adding memory attention, memory encoder, and memory bank, which surprisingly ablated better than more intuitive but complex architectures like Gated Recurrent Units. One has to wonder if streaming memory can be added to pure language models with a similar approach… (pls comment if there’s an obvious one we haven’t come across yet!) Video Podcast Tune in to Latent Space TV for the video demos mentioned in this video podcast! Resources referenced Show References * https://sam2.metademolab.com/demo * roboflow.com/sam2 * https://github.com/autodistill/autodistill * https://github.com/facebookresearch/segment-anything-2 * https://rf100.org * https://blog.roboflow.com/label-data-with-grounded-sam-2/ * https://arxiv.org/abs/2408.00714 * https://github.com/roboflow/notebooks * https://x.com/skalskip92/status/1818648396002951178https://x.com/skalskip92/status/1818648396002951178 * https://blog.roboflow.com/sam-2-video-segmentation/ Timestamps * [00:00:00] The Rise of SAM by Udio (David Ding Edit) * [00:03:07] Introducing Nikhila * [00:06:38] The Impact of SAM 1 in 2023 * [00:12:15] Do People Finetune SAM? * [00:16:05] Video Demo of SAM * [00:20:01] Why the Demo is so Important * [00:23:23] SAM 1 vs SAM 2 Architecture * [00:26:46] Video Demo of SAM on Roboflow * [00:32:44] Extending SAM 2 with other models * [00:35:00] Limitations of SAM: Screenshots * [00:38:56] SAM 2 Paper * [00:39:15] SA-V Dataset and SAM Data Engine * [00:43:15] Memory Attention to solve Video * [00:47:24] "Context Length" in Memory Attention * [00:48:17] Object Tracking * [00:50:52] The Future of FAIR * [00:52:23] CVPR, Trends in Vision * [01:02:04] Calls to Action Transcript [00:00:00] [music intro] [00:02:11] AI Charlie: Happy Yoga! This is your AI co host Charlie. Thank you for all the love for our special 1 million downloads Wins of AI Winter episode last week, especially Sam, Archie, Trellis, Morgan, Shrey, Han, and more. For this episode, we have to go all the way back to the first viral episode of the podcast Segment Anything Model and the Hard Problems of Computer Vision, which we discussed with Joseph Nelson of Roboflow. [00:02:39] AI Charlie: Since Meta released SAM 2 last week, we are delighted to welcome Joseph back as our fourth guest co host to chat with Nikhila Ravi, Research Engineering Manager at Facebook AI Research and lead author of SAM 2. Just like our SAM 1 podcast, this is a multimodal pod because of the vision element, so we definitely encourage you to hop over to our YouTube at least for the demos, if not our faces. [00:03:04] AI Charlie: Watch out and take care. [00:03:10] Introducing Nikhila [00:03:10] swyx: Welcome to the latest podcast. I'm delighted to do segment anything to our first, one of our very first viral podcasts was segment anything one with Joseph. Welcome back. Thanks so much. And this time we are joined by the lead author of Segment Anything 2, Nikki Ravi, welcome. [00:03:25] Nikhila Ravi: Thank you. Thanks for having me. [00:03:26] swyx: There's a whole story that we can refer people back to episode of the podcast way back when for the story of Segment Anything, but I think we're interested in just introducing you as a researcher, as a, on the human side what was your path into AI research? Why, you know, why did you choose computer vision coming out of your specialization at Cambridge? [00:03:46] Nikhila Ravi: So I did my undergraduate. Degree in engineering at Cambridge university. The engineering program is very general. So first couple of years, you sort of study everything from mechanical engineering to fluid mechanics, structural mechanics, material science, and also computer science. [00:04:04] Nikhila Ravi: Towards the end of my degree, I started taking more classes in machine learning and computational neuroscience, and I really enjoyed it. And actually after graduating from undergrad, I had a place at Oxford to study medicine. And so I was. Initially planning on becoming a doctor, had everything planned and then decided to take a gap year after finishing undergrad. [00:04:28] Nikhila Ravi: And actually that was around the time that sort of deep learning was emerging. And in my machine learning class in undergrad, I remember one day our professor came in and that was when Google acquired DeepMind. And so that became like a huge thing. We talked about it for the whole class. It kind of really stuck. [00:04:48] Nikhila Ravi: And I was kicked off thinking about, okay, maybe I want to try something different other than medicine. Maybe this is a different path I want to take. And then in the gap year, I did a bunch of coding, worked on a number of projects. Did some sort of freelance contracting work. And then I got a scholarship to come and study in America. [00:05:06] Nikhila Ravi: So I went to Harvard for a year, took a bunch of computer science classes at Harvard and MIT, worked on a number of AI projects, especially in computer vision. I really, really enjoyed working in computer vision. I applied to Facebook and got this job at Facebook, and I've now at Facebook at the time, now Meta, and I've been here for seven years, so very circuitous path, probably not a very unconventional, I didn't do a PhD, I'm not like a research, typical research scientist, definitely came from more of an engineering background, but since being at Meta, Have had amazing opportunities to work across so many different interesting problems in computer vision from 3D computer vision. [00:05:50] Nikhila Ravi: How can you go from images of objects to 3D structures and then going back to 2D computer vision and actually understanding the objects and the pixels and the images themselves. So it's been a very interesting journey over the past seven years. [00:06:05] swyx: It's weird because like, I guess with segment anything too, it's like 4D because you solve time, you know, you started with 3D and now you're solving the 4D. [00:06:14] Nikhila Ravi: Yeah, it's just going from 3D to images to video. It's really covering the full spectrum. And actually, one of the nice things has been, so I think I mentioned I, Wanted to become a doctor, but actually Sam is having so much impact in medicine, probably more than I co
Thank you for 1m downloads of the podcast and 2m readers of the Substack! 🎉 This is the audio discussion following The Winds of AI Winter essay that also serves as a recap of Q2 2024 in AI viewed through the lens of our Four Wars framework. Enjoy! Full Video Discussion Full show notes are here. Timestamps * [00:00:00] Intro Song by Suno.ai * [00:02:01] Swyx and Alessio in Singapore * [00:05:49] GPU Rich vs Poors: Frontier Labs * [00:06:35] GPU Rich Frontier Models: Claude 3.5 * [00:10:37] GPU Rich helping Poors: Llama 3.1: The Synthetic Data Model * [00:15:41] GPU Rich helping Poors: Frontier Labs Vibe Shift - Phi 3, Gemma 2 * [00:18:26] GPU Rich: Mistral Large * [00:21:56] GPU Rich: Nvidia + FlashAttention 3 * [00:23:45] GPU Rich helping Poors: Noam Shazeer & Character.AI * [00:28:14] GPU Poors: On Device LLMs: Mozilla Llamafile, Chrome (Gemini Nano), Apple Intelligence * [00:35:33] Quality Data Wars: NYT vs The Atlantic lawyer up vs partner up * [00:37:41] Quality Data Wars: Reddit, ScarJo, RIAA vs Udio & Suno * [00:41:03] Quality Data Wars: Synthetic Data, Jagged Intelligence, AlphaProof * [00:45:33] Multimodality War: ChatGPT Voice Mode, OpenAI demo at AIEWF * [00:47:34] Multimodality War: Meta Llama 3 multimodality + Chameleon * [00:50:54] Multimodality War: PaliGemma + CoPaliGemma * [00:52:55] Renaming Rag/Ops War to LLM OS War * [00:55:31] LLM OS War: Ops War: Prompt Management vs Gateway vs Observability * [01:02:57] LLM OS War: BM42 Vector DB Wars, Memory Databases, GraphRAG * [01:06:15] LLM OS War: Agent Tooling * [01:08:26] LLM OS War: Agent Protocols * [01:10:43] Trend: Commoditization of Intelligence * [01:16:45] Trend: Vertical Service as Software, AI Employees, Brightwave, Dropzone * [01:20:44] Trend: Benchmark Frontiers after MMLU * [01:23:31] Crowdstrike will save us from Skynet * [01:24:30] Bonus: ChatGPT Advanced Voice Mode Demo * [01:25:37] Voice Mode: Storytelling * [01:27:55] Voice Mode: Accents * [01:31:48] Voice Mode: Accent Detection * [01:35:00] Voice Mode: Nonverbal Emotions * [01:37:53] Voice Mode: Multiple Voices in One * [01:40:52] Voice Mode: Energy Levels Detection * [01:42:03] Voice Mode: Multilinguality * [01:43:53] Voice Mode: Shepard Tone * [01:46:57] Voice Mode: Generating Tones * [01:49:39] Voice Mode: Interruptions don't work * [01:49:55] Voice Mode: Reverberations * [01:51:37] Voice Mode: Mimicry doesn't work Transcript Charlie [00:01:08]: Welcome back, listeners. This is your AI co-host, Charlie. It's been a few months since we took a step back from the interview format and talked about the show. We're happy to share that we have crossed one million downloads and two million reads on Substack. Woo-hoo. We are really grateful to those of you who keep tuning in and sharing us with your friends, especially if who watch and comment on our new YouTube channel, where we are trying to grow next. For a special millionaire edition, SWIX and Alessio are finally back in person in sunny Singapore to discuss the big vibe shift in the last three months, that we are calling the Winds of AI Winter. We also discuss my nemesis, ChatGPT Advanced Voice Mode, with a special treat for those who stay till the end. Now, more than ever, watch out and take care. Alessio [00:02:02]: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence and Decibel Partners, and today we're in the Singapore studio with SWIX. Swyx [00:02:11]: Hey, this is our long-awaited one-on-one episode. I don't know how long ago the previous one was. Do you remember? Three, four months? Alessio [00:02:20]: Yeah, it's been a while. Swyx [00:02:22]: People really enjoyed it. It's just really, I think our travel schedules have been really difficult to get this stuff together. And then we also had like a decent backlog of guests for a while. I think we've kind of depleted that backlog now and we need to build it up again. But it's been busy and there's been a lot of news. So we actually get to do this like sort of rapid fire thing. I think some people, you know, the podcast has grown a lot in the last six months. Maybe just reintroducing like what you're up to, what I'm up to, and why we're here in Singapore and stuff like that. Alessio [00:02:51]: Yeah. My first time here in Singapore, which has been really nice. This country is really amazing, I would say. First of all, everything feels like the busiest part of the city. Everything is skyscrapers. There's like plants in all the buildings, or at least in the areas that I've been in, which has been awesome. And I was at one of the offices kind of on the south side and from the 38th floor, you can see Indonesia on one side and you can see Malaysia on the other side. So it's quite, quite small. One of the people there said their kid goes to school at the border with Malaysia basically, so they could drive to Malaysia every day. So they go pick her up from school. Yeah. And we came here, we hosted with you, the Sovereign AI Summit Wednesday night. We had a lot of folks. Swyx [00:03:31]: NVIDIA, Goldman, Temasek, Singtel. Alessio [00:03:34]: And we got to talk about this trend of sovereign AI, which maybe we might cover on another episode, but basically how do you drive, if you're a country, how do you drive productivity growth in a time where populations are shrinking, the workforce is shrinking and AI can kind of supplement a lot of this. And then the question is, okay, should I put all this money in foundation models? Should I put it in data centers and infrastructure? Should I put it in GPUs? Should I put it in agents and whatnot? So we'll touch on some of these trends in the episode, but it was a fun event. And I did not expect some of the most senior people at the largest financial institution in Singapore ask about state space models and some of the alternatives. So it's great to see how advanced the conversation is sometimes. Swyx [00:04:16]: Yeah. I think that that is mostly people trying to listen to jargon that is being floated around as like, oh, what could kill transformers? And then they jump straight there without actually exploring the fundamentals, the basics of what they will actually put to work. That's fine. It's a forum to ask questions. So you want to ask about the future, but I feel like it's not very practical to spend so much time on those things. Part of the things that I do in space, especially when I travel, is to try to ask questions about what countries that are not the US and not San Francisco can do, because everyone feels a bit left out. You feel it here as well. And I'm trying to promote alternatives. I think AI engineering is one way that countries can capitalize on the industry without building a hundred billion dollar cluster, which is one-fifth the GDP of Singapore. And so my pitch at the summit was that we would sample with the AIGeneration. We're also working on bringing the AIGeneration conference to Singapore next year together with iClear. So yeah, we're just trying my best and I'm being looped into various government meetings to try to make that happen. Alessio [00:05:25]: Well, we'll definitely be here next year. I'll be back here very often. It's really nice. Swyx [00:05:31]: Yeah. Awesome. Okay. Well, we have a lot of news. How do you think we should cover? Alessio [00:05:36]: Maybe just recap since the framework of the four words of AI is something that came up end of last year. So basically, we'll link in the show notes, but the end of year recap for 2023 was basically the four words of AI, which we picked GPU-rich versus GPU-poor, the data quality wars, the multimodality wars, and the reg slash ops wars. So usually everything falls back under those four categories. So I'm pretty happy that seven months later, it's something that still matters. Swyx [00:06:07]: It still kind of holds up. Alessio [00:06:08]: Yeah. Most AI stuff from eight months ago, it's really not that relevant anymore. And today we'll try and bucket some of the recent news on it. We haven't done a monthly thing in like three months. So three months is a lot of stuff. Swyx [00:06:23]: That's mostly because I got busy with the conference. But I do want to get back on that horse or maybe just do it weekly so that I don't have such a big lift that I don't do it. I think the activation energy is the problem really. So yeah, I think frontier model wise, it seems like Cloud has really carved out a persistent space for itself. For a long time, I thought it was kind of like a clear number two to open AI. And with 3.5 on it, at least in some of the hard benchmarks on LMSys or coding benchmarks on LMSys, it is the undisputed number one model in the world, even with 4.0 mini. And we can talk about 4.0 mini and benchmarking later on. But for Cloud to be there and hold that position for what is more than a month now in AI time is a big deal. There's not much that people know publicly about what Enthopic did for Cloud's on it. But I think it's still a huge achievement. It marks the beginning of a non-open AI centric world to the point where people on Twitter have canceled ChatGPT. That's been a trend that's been going on for a while. We talked about the unbundling of ChatGPT. But now new open source projects and tooling, they're just built for Cloud. They don't even use open AI. That's a strategic threat to open AI, I think, a little bit. Obviously, open AI is so big that it doesn't really care about that. But for Enthopic, it's a big win. I think to see that going and to see Enthopic differentiating itself and actually implementing research. So the rumor is that the scaling monosematicity paper that they put out two months ago was a big part of Cloud 3.5's on it. I've had off-the-record chats with people about that idea, and they don't agree that it is the only cause. So I was thinking this is the only thing that they did. But people say that there's about four or five other tricks that they haven't disclosed yet that went int
If you see this in time, join our emergency LLM paper club on the Llama 3 paper! For everyone else, join our special AI in Action club on the Latent Space Discord for a special feature with the Cursor cofounders on Composer, their newest coding agent! Today, Meta is officially releasing the largest and most capable open model to date, Llama3-405B, a dense transformer trained on 15T tokens that beats GPT-4 on all major benchmarks: The 8B and 70B models from the April Llama 3 release have also received serious spec bumps, warranting the new label of Llama 3.1. If you are curious about the infra / hardware side, go check out our episode with Soumith Chintala, one of the AI infra leads at Meta. Today we have Thomas Scialom, who led Llama2 and now Llama3 post-training, so we spent most of our time on pre-training (synthetic data, data pipelines, scaling laws, etc) and post-training (RLHF vs instruction tuning, evals, tool calling). Synthetic data is all you need Llama3 was trained on 15T tokens, 7x more than Llama2 and with 4 times as much code and 30 different languages represented. But as Thomas beautifully put it: “My intuition is that the web is full of s**t in terms of text, and training on those tokens is a waste of compute.” “Llama 3 post-training doesn't have any human written answers there basically… It's just leveraging pure synthetic data from Llama 2.” While it is well speculated that the 8B and 70B were "offline distillations" of the 405B, there are a good deal more synthetic data elements to Llama 3.1 than the expected. The paper explicitly calls out: * SFT for Code: 3 approaches for synthetic data for the 405B bootstrapping itself with code execution feedback, programming language translation, and docs backtranslation. * SFT for Math: The Llama 3 paper credits the Let’s Verify Step By Step authors, who we interviewed at ICLR: * SFT for Multilinguality: "To collect higher quality human annotations in non-English languages, we train a multilingual expert by branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingualtokens." * SFT for Long Context: "It is largely impractical to get humans to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for long documents, and reasoning over code repositories, and describe them in greater detail below" * SFT for Tool Use: trained for Brave Search, Wolfram Alpha, and a Python Interpreter (a special new ipython role) for single, nested, parallel, and multiturn function calling. * RLHF: DPO preference data was used extensively on Llama 2 generations. This is something we partially covered in RLHF 201: humans are often better at judging between two options (i.e. which of two poems they prefer) than creating one (writing one from scratch). Similarly, models might not be great at creating text but they can be good at classifying their quality. Last but not least, Llama 3.1 received a license update explicitly allowing its use for synthetic data generation. Llama2 was also used as a classifier for all pre-training data that went into the model. It both labelled it by quality so that bad tokens were removed, but also used type (i.e. science, law, politics) to achieve a balanced data mix. Tokenizer size matters The tokens vocab of a model is the collection of all tokens that the model uses. Llama2 had a 34,000 tokens vocab, GPT-4 has 100,000, and 4o went up to 200,000. Llama3 went up 4x to 128,000 tokens. You can find the GPT-4 vocab list on Github. This is something that people gloss over, but there are many reason why a large vocab matters: * More tokens allow it to represent more concepts, and then be better at understanding the nuances. * The larger the tokenizer, the less tokens you need for the same amount of text, extending the perceived context size. In Llama3’s case, that’s ~30% more text due to the tokenizer upgrade. * With the same amount of compute you can train more knowledge into the model as you need fewer steps. The smaller the model, the larger the impact that the tokenizer size will have on it. You can listen at 55:24 for a deeper explanation. Dense models = 1 Expert MoEs Many people on X asked “why not MoE?”, and Thomas’ answer was pretty clever: dense models are just MoEs with 1 expert :) [00:28:06]: I heard that question a lot, different aspects there. Why not MoE in the future? The other thing is, I think a dense model is just one specific variation of the model for an hyperparameter for an MOE with basically one expert. So it's just an hyperparameter we haven't optimized a lot yet, but we have some stuff ongoing and that's an hyperparameter we'll explore in the future. Basically… wait and see! Llama4 Meta already started training Llama4 in June, and it sounds like one of the big focuses will be around agents. Thomas was one of the authors behind GAIA (listen to our interview with Thomas in our ICLR recap) and has been working on agent tooling for a while with things like Toolformer. Current models have “a gap of intelligence” when it comes to agentic workflows, as they are unable to plan without the user relying on prompting techniques and loops like ReAct, Chain of Thought, or frameworks like Autogen and Crew. That may be fixed soon? 👀 The whole podcast was a lot of fun to record, as usual you can find show notes and chapters below. Make sure to also subscribe on YouTube! 🙏 Full Video Podcast Show Notes * Thomas Scialom * Recital * Galactica * Lucas Beyer - Citation Generator * Llama 2 paper * Guillaume Lample * Hugo Touvron * April 2023 Llama 3 release * Llama3 Repo * Chinchilla trap * Agents research * Thomas’ paper: Augmented Language Models: A Survey * GAIA: Gaia General Assistant Benchmark (we interviewed Thomas at ICLR on this) * Toolformer paper * JEPA * Clementine Fourrier episode * Nathan Lambert episode * Noam Shazeer * Optimizing AI Inference at Character.AI aka Shazeer et al 2024 - we misspoke and said “native FP8” when we meant INT8 * The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits * Mentioned Papers * MobileLLM * SmolLM * Overleaf * AlphaGo * Lindy AI Timestamps * Song credit: Code of the Future via Udio * [00:00:13] Introducing Thomas * [00:03:18] BLOOM and Meta Galactica * [00:06:33] Leading Llama 2 * [00:09:56] Going 100x Chinchilla Scaling Laws * [00:12:15] Open Sourcing Llama 3 405B * [00:14:29] Quantization with INT8 / FP8 / Ternary (1.58 Bits) * [00:16:58] MobileLLM, SmolLM, On Device Models * [00:17:36] Llama 3 Architecture * [00:18:33] Llama 3 Tokenizer: 128k and beyond * [00:23:12] Synthetic Data for Pretraining * [00:25:08] Synthetic Data from Augmented Language Models * [00:27:19] Data Mix and Continual Pretraining * [00:29:16] Adding Code, Reasoning, Multilinguality to Llama 3 * [00:30:39] Nvidia Nemotron and dedicated SynData Models * [00:31:30] Why no MOE? * [00:32:23] RLHF: Humans as Discriminators > Annotators * [00:38:37] Teacher Forcing/Critique * [00:42:02] Llama 3 Benchmarking * [00:45:24] Llama 3 Arena ELO * [00:47:27] Calibration Evals * [00:49:23] Function Calling * [00:50:17] Llama 4's plan for Agents * [00:55:09] The State of Variable/Long Inference Research * [00:57:19] Llama 4 Focus * [00:59:15] AI Startups * [01:03:34] Call to Action - Hiring Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:13]: Hey, and today we have a very special episode with Thomas Scialom. I don't know how to describe, you've done so much work in a very short amount of time at Meta, but you were most notably leading Llama 2 and now today we're also coordinating on the release of Llama 3. So welcome. Thomas [00:00:28]: Thanks for having me. Swyx [00:00:29]: So let's play obviously the Llama 3 405B. Is that the official size number that we're going with, or do we just say 400B? Thomas [00:00:37]: For the text model only, yes. A bit of additional parameters for the multi-model version that will come later. Swyx [00:00:44]: Awesome. Just to quickly go over your background, actually we had a slightly similar past. I was also a quantitative trader and it looks like you did five years in QuantFinance, working a trading timer in SockGen, and then you transitioned into natural language, getting a PhD at Sorbonne. Working on Recital as well. And then right after your PhD, joining Meta. Thomas [00:01:04]: No, it's exactly that, but basically I think it's at the AlphaGo moment where I was doing some trading. I say like, what I need to understand, what's the technology behind that? And I wanted to study machine learning. I did first some training, like six months degree, executive degree, at the end of which I knew like what XGBoost at the time, and nothing about deep learning at all. And most of the people around were like PhD people, and I was like, okay, PhD seems pretty cool, deep learning seems pretty cool, so I want to do a PhD in deep learning. That's where I joined, we have this PhD program in France within a company and academia. And so I did my PhD with Recital and Sorbonne University on natural language generation reinforcement learning. I guess it was a good topic. I was not like a visionary. It was very random. I've had a company that offered me this topic, and it was something like I started two weeks before BERT. Excellent timing. Swyx [00:02:03]: Yeah. We actually also just released our episode with Clementine Fouquier, who also did her PhD with a company in kind of like a very similar format. I think, yeah, very underrated, very underrated, this sort of PhD with industry expertise, becau
The first AI Engineer World’s Fair talks from OpenAI and Cognition are up! In our Benchmarks 101 episode back in April 2023 we covered the history of AI benchmarks, their shortcomings, and our hopes for better ones. Fast forward 1.5 years, the pace of model development has far exceeded the speed at which benchmarks are updated. Frontier labs are still using MMLU and HumanEval for model marketing, even though most models are reaching their natural plateau at a ~90% success rate (any higher and they’re probably just memorizing/overfitting). From Benchmarks to Leaderboards Outside of being stale, lab-reported benchmarks also suffer from non-reproducibility. The models served through the API also change over time, so at different points in time it might return different scores. Today’s guest, Clémentine Fourrier, is the lead maintainer of HuggingFace’s OpenLLM Leaderboard. Their goal is standardizing how models are evaluated by curating a set of high quality benchmarks, and then publishing the results in a reproducible way with tools like EleutherAI’s Harness. The leaderboard was first launched summer 2023 and quickly became the de facto standard for open source LLM performance. To give you a sense for the scale: * Over 2 million unique visitors * 300,000 active community members * Over 7,500 models evaluated Last week they announced the second version of the leaderboard. Why? Because models were getting too good! The new version of the leaderboard is based on 6 benchmarks: * 📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper) * 📚 GPQA (Google-Proof Q&A Benchmark, paper) * 💭MuSR (Multistep Soft Reasoning, paper) * 🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper) * 🤝 IFEval (Instruction Following Evaluation, paper) * 🧮 🤝 BBH (Big Bench Hard, paper) You can read the reasoning behind each of them on their announcement blog post. These updates had some clear winners and losers, with models jumping up or down up to 50 spots at once; the most likely reason for this is that the models were overfit to the benchmarks, or had some contamination in their training dataset. But the most important change is in the absolute scores. All models score much lower on v2 than they do on v1, which now creates a lot more room for models to show improved performance. On Arenas Another high-signal platform for AI Engineers is the LMSys Arena, which asks users to rank the output of two different models on the same prompt, and then give them an ELO score based on the outcomes. Clémentine called arenas “sociological experiments”: it tells you a lot about the users preference, but not always much about the model capabilities. She pointed to Anthropic’s sycophancy paper as early research in this space: We find that when a response matches a user’s views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. The other issue is that Arena rankings aren’t reproducible, as you don’t know who ranked what and what exactly the outcome was at the time of ranking. They are still quite helpful as tools, but they aren’t a rigorous way to rank capabilities of the models. Her advice for both arena and leaderboard is to use these tools as ranges; find 3-4 models that fit your needs (speed, cost, capabilities, etc) and then do vibe checks to figure out which one is best for your specific task. LLMs aren’t good judges In the last ~6 months, there has been an increased interest in using LLMs as Judges: rather than asking a person to evaluate the outcome of a model, you can ask a more powerful LLM to score it. We covered this a bit in our Brightwave episode last month as well. HuggingFace also has a cookbook on it, but Clémentine was actually not a fan of this approach: * Mode collapse: if you are asking a model to choose which output is better, it will just self-reinforce its own preferences. It will also prefer models from its own family (i.e. GPT models will prefer other GPT models over Claude outputs). If these outputs are then used to fine-tune the model, you will further mode collapse the model. Cohere for example has said they do not train on any model-generated data to avoid this. * Positional bias: LLMs usually prefer the first answer, so you can’t naively give them options and ask them to rank them, but you also have to mix up the order in which they appear. * Don’t score, rank: rather than asking a model to assign a score to each output, you should have it stack-rank them. The models aren’t trained to score things, so even though they might understand what response is better, assigning a score to it is hard. If you do have to use LLMs as Judges (we aren’t all ScaleAI-rich!), she suggested using an open LLM like Prometheus or JudgeLM to make sure you can reproduce those rankings in the future. Show Notes * Clémentine Fourrier * Hugging Face * OpenLLM v2 Leaderboard * Let’s talk about LLM Evaluation * Leaderboard V2 Blog Post * Latent Space Benchmarks 101 * Gradient AI epsiode on Long Context Evals * Allen AI long context novel evals Companies and Organizations * Anthropic * Cohere * EleutherAI * INRIA * ICLR (International Conference on Learning Representations) People * Aidan Gomez * Dan Hendrycks * Edward Beeching * Haley Sholkoff * Lewis Tunstall * Nathan Habib * Thomas Scialom Projects, Models, and Benchmarks * LMSys Arena * ARC AGI Challenge * Allen Institute ARC Challenge * BigBench * GAIA benchmark * GPQA * GSM 8K * IFEval * LightEval * ML perf * MMLU * JudgeLM * Prometheus * RavenWolf * SWE-Bench * Vantage Timestamps * [00:00:00] Introductions * [00:02:32] How Clémentine went from geology to AI * [00:05:52] Origin of the OpenLLM Leaderboard * [00:09:06] How v1 Benchmarks Were Selected * [00:10:49] The Problem with Current Benchmarks * [00:13:45] Saturating benchmarks and the future of evaluation * [00:16:14] Issues with human evaluations * [00:24:07] AI girlfriends as the multi-turn benchmark * [00:25:35] What's New in OpenLLM leaderboard V2 * [00:28:12] Benchmark Answers Black Market * [00:30:21] The impact of prompt formatting on model evaluation scores * [00:33:30] Difficulty and Computational Constraints of Evals * [00:36:28] The Responsibility of Setting Standards * [00:40:35] The Economics of OpenLLM * [00:44:15] Long context reasoning benchmarks * [00:46:34] Agent benchmarks, GAIA, and the ARC AGI challenge * [00:50:43] Vibe check for benchmarks * [00:53:16] Request for benchmarks * [00:56:48] v3 predictions? Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:13]: Hey, and today we have a super special guest that we've been trying to book on the schedule for a while. It's Clémentine Fourrier. I'm trying my best to do the French, but maybe you can do a better job of it than me. Clémentine [00:00:26]: This was perfect. It's Clémentine Fourrier, but your pronunciation was really on point. Swyx [00:00:31]: There was a Fourrier, which is very sort of French intonation, which I don't really understand. So I'll introduce you off of your LinkedIn and I would love for you to fill in the blanks. You are currently a research scientist at Hugging Face and the maintainer of the OpenLLM leaderboard, which we'll talk about very shortly. Previously, you were at INRIA as well, but then it looks like you also concurrently got your PhD at the same time. How does that work? Is that a very common thing? Clémentine [00:01:01]: So I basically did my PhD at INRIA, technically. So INRIA funded my PhD and PhDs in France are three years, but I also worked as an engineer at INRIA before my PhD, hence maybe the confusion. Swyx [00:01:14]: I think there's a rise in universities having sort of industrial attachments to these things. And I think it actually makes for a much more grounded study, especially if you're doing your sort of graduate studies and all these things. I think it's rising in North America as well with Berkeley and with Waterloo in Toronto. Cool. Like, you know, there's, there's a lot of other things we can, we can introduce. I can't really pronounce the name of the, the university you went to, but what else should people know? Clémentine [00:01:44]: So I actually, technically I'm an engineer in geology. So I studied rocks and I graduated in 2015 after having done like extensive studies about rocks. And I discovered I was very bad at it, but I was very good at computer science. So I went to computer science. What stuck with me though is that geology is very much an experimental science. And I think that machine learning is very much an experimental science too, even though people want to claim that it's pure math. And I worked on several machine learning projects throughout the years, a bit of the prediction of illnesses in the brain at Brain and Spine Institute in Paris. I worked as an engineer in a research team in NLP where I did my thesis and then I joined Hugging Face. Swyx [00:02:32]: Do you have a favorite rock fact or sort of rock story before we get into the NLP stuff? Clémentine [00:02:38]: Okay. I was not expecting this question. Swyx [00:02:43]: I did my geography A-levels and I always loved learning about like isostasy and stuff like that where you have different plates kind of up and down in the mantle. And I don't think people think about vertical dimensions to geographical plates, but it's real. Clémentine [00:03:02]: Yeah, definitely. And like when you do geology, the time scale is just not the same. There is like one specific place in France where you can see rocks that are 1 billion years old and like the sheer scale of this is huge. Yeah, that's what I loved about geology, that the scale is completel
Livestreams for the AI Engineer World’s Fair (Multimodality ft. the new GPT-4o demo, GPUs and Inference (ft. Cognition/Devin), CodeGen, Open Models tracks) are now live! Subscribe to @aidotEngineer to get notifications of the other workshops and tracks! It’s easy to get de-sensitized to new models topping leaderboards every other week — however, the top of the LMsys leaderboard has typically been the exclusive domain of very large, very very well funded model labs like OpenAI, Anthropic, Google, and Meta. OpenAI had about 600 people at the time of GPT-4, and Google Gemini had 950 co-authors. This is why Reka Core made waves in May - not only debuting at #7 on the leaderboard, but doing so with all-new GPU infrastructure and 20 employees with and a relatively puny $60m in funding. Shortly after the release of GPT3, Sam Altman speculated on the qualities of “10,000x researchers”: * “They spend a lot of time reflecting on some version of the Hamming question—"what are the most important problems in your field, and why aren’t you working on them?” In general, no one reflects on this question enough, but the best people do it the most, and have the best ‘problem taste’, which is some combination of learning to think independently, reason about the future, and identify attack vectors.” — sama * Taste is something both John Schulman and Yi Tay emphasize greatly * “They have a laser focus on the next step in front of them combined with long-term vision.” — sama * “They are extremely persistent and willing to work hard… They have a bias towards action and trying things, and they’re clear-eyed and honest about what is working and what isn’t” — sama “There's a certain level of sacrifice to be an AI researcher, especially if you're training at LLMs, because you cannot really be detached… your jobs could die on a Saturday at 4am, and there are people who will just leave it dead until Monday morning, or there will be people who will crawl out of bed at 4am to restart the job, or check the TensorBoard” – Yi Tay (at 28 mins) “I think the productivity hack that I have is, I didn't have a boundary between my life and my work for a long time. So I think I just cared a lot about working most of the time. Actually, during my PhD, Google and everything [else], I'll be just working all the time. It's not like the most healthy thing, like ever, but I think that that was actually like one of the biggest, like, productivity, like and I spent, like, I like to spend a lot of time, like, writing code and I just enjoy running experiments, writing code” — Yi Tay (at 90 mins) * See @YiTayML example for honest alpha on what is/is not working and so on. More recently, Yi’s frequent co-author, Jason Wei, wrote about the existence of Yolo researchers he witnessed at OpenAI: Given the very aggressive timeline — Yi left Google in April 2023, was GPU constrained until December 2023, and then Reka Flash (21B) was released in Feb 2024, and Reka Core (??B) was released in April 2024 — Reka’s 3-5 person pretraining team had no other choice but to do Yolo runs. Per Yi: “Scaling models systematically generally requires one to go from small to large in a principled way, i.e., run experiments in multiple phrases (1B->8B->64B->300B etc) and pick the winners and continuously scale them up. In a startup, we had way less compute to perform these massive sweeps to check hparams. In the end, we had to work with many Yolo runs (that fortunately turned out well). In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut feeling and instinct.” We were excited to be the first podcast to interview Yi, and recommend reading our extensive show notes to follow the same papers we reference throughout the conversation. Special thanks to Terence Lee of TechInAsia for the final interview clip, who are launching their own AI newsletter called The Prompt! Full Video Podcast Show Notes * Yi on LinkedIn, Twitter, Personal * Full prep doc * Reka funding/valuation * Building frontier AI teams as GPU Poors * Yi’s Research * 2020 * Efficient Transformers: A Survey went viral! * Long Range Arena: A Benchmark for Efficient Transformers in 2020 * 2021: Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study * 2022: * UL2: Unifying Language Learning Paradigms * PaLM -> PaLM-2 * Emergent Abilities of Large Language Models vs the Mirage paper * Recitation Augmented generation * DSI++: Updating Transformer Memory with New Documents * The Efficiency Misnomer: “a model with low FLOPs may not actually be fast, given that FLOPs does not take into account information such as degree of parallelism (e.g., depth, recurrence) or hardware-related details like the cost of a memory access” * 2023: Flan-{PaLM/UL2/T5} 1.8k tasks for instruction tuning * Encoder-decoder vs Decoder only * Latent Space Discord discussion on enc-dec vs dec-only * Related convo with Yi Tay vs Yann LeCun * @teortaxes: * If 2024 papers are to be trusted: You don't need (most) attention you don't need (most) kv cache You don't need (most) FFN layers You don't need a reward model You don't need… all the stuff that still makes frontier models work, ironically * “there have been no real advance since 2019's T5 models” * The future of Open source models - relevant to a16z vs Founders Fund debate. Open source cannot compete! Timestamps * [00:00:00] Intro * [00:01:57] Yi Tay Intro * [00:03:02] Path into LLMs * [00:09:41] Google Brain: PaLM, UL2, DSI, Emergent Abilities * [00:11:54] PaLM 2 * [00:15:27] Emergent Abilities * [00:18:26] Quoc Le * [00:24:16] Marketing Research: How to Start from Zero with No Reach * [00:27:34] What's needed to be a successful AI Researcher? * [00:30:31] Reka Origin * [00:33:24] Starting Reka Infra * [00:35:04] Why not to use TPUs outside Google * [00:36:29] Chaotic vs Stable Infra * [00:38:04] Risk Sharing of Bad Nodes * [00:41:05] Checkpointing and Orchestration * [00:43:39] Reka Flash/Core/Edge * [00:46:59] Recruiting the team * [00:47:22] Noam Architecture - Swiglu, GQA, RMSnorm, ROPE * [00:52:26] Encoder-decoder vs Decoder-only * [00:55:52] LLM Trends - Llama 3 and Phi 3 Glowup * [00:57:46] LLM Trends - Benchmarks and Evals * [01:03:25] LLM Trends - Early vs Late Fusion Multimodality * [01:07:22] LLM Trends - Scaling Laws * [01:09:41] LLM Trends - Long Context vs RAG * [01:12:31] Long Context vs Finetuning * [01:14:14] If emergence is real, when does Efficiency work? * [01:17:41] MoEs and Upcycling * [01:20:47] The Efficiency Misnomer - Efficiency != Speed * [01:25:05] Open Source vs Closed Models * [01:28:08] Personal Productivity * [01:33:19] Singapore vs US Academic Scene * [01:37:42] Building Silicon Valley outside Silicon Valley * [01:40:29] TechInAsia Meetup Transcript [00:00:00] swyx: Thanks for watching. Bye bye. [00:00:05] AI Charlie: Welcome back, friends. It's only been a week since the World's Fair, and it was incredible gathering the community to see the latest and greatest in AI engineering. You can catch up now on the four live stream track days on the AI Engineer YouTube, and our team is busy editing the remaining workshops and five other tracks, including the surprisingly popular AI Leadership track. [00:00:28] Thank you all for your support, and stay tuned for news about the next event. The 2025 AI Engineer Summit. Last week, we did a very special deep dive with Josh and John of InView and Databricks Mosaic on training LLMs and setting up massive GPU clusters. And today, we're pleased to follow that up with a very special conversation with Yi Tai, formerly tech lead of Palm 2 at Google Brain, and now chief scientist of Reka. [00:00:56] ai. Raker's largest model, Raker Core, was at launch. The fifth best model in the world. And the only GPT 4 class model not trained by a big lab like OpenAI, Google, Anthropic or Meta. In fact, while Google Gemini has 950 co authors, Raker only has 20 employees. With up to five people actually working on pre training. [00:01:21] Swyx was excited to return to Singapore to delve into Yi Reka and building a new AI model lab outside of Silicon Valley. Stay tuned to the very end for a special bonus clip from Yi's recent appearance at the Tekinesia meetup for his spiciest take on why senior management is overrated and why this is the time to build up senior 10, 000x individual contributors like himself. [00:01:46] Watch out and take care. [00:01:48] swyx: Welcome to lay space. This is a long time coming, but I'm so excited to have you here. [00:01:52] Yi Tay: Yeah, thanks for, thanks for inviting and excited to be here. chat about a lot of stuff. [00:01:57] Yi Tay Intro [00:01:57] swyx: Yeah. So you are interesting to research and introduce. You are now chief scientist of Rega, which is a super interesting model lab, but before that you were at Google Brain, you were architecture co-lead on POM two, you were inventor of UL two. [00:02:10] You're a core contributor on Flan, you're a member of the Bard core team, and you also did some work on generative retrieval. That's a very, very illustrious three year career at Google Brain. [00:02:19] Yi Tay: Yeah, thanks, thanks, thanks, yeah. [00:02:20] swyx: And then since then, Reka, you joined in March 2023, announced a 58 million series in June 2023. [00:02:26] I don't know if you know, the post money valuation, or the pre money valuation is public. So it's, crunch basis is, is, Oh, okay, okay. I [00:02:33] Yi Tay: did not know that yet. 50 [00:02:34] swyx: something million. So you don't even have t
It’s return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition): Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B. While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else: * Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks * An entirely new code-focused reasoning benchmark * A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity * A new dataset of 450,000 human judgments about ambiguity * Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training * Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any size As well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta’s OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue. We are busy running the sold-out AI Engineer World’s Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes. Video pod Timestamps * [00:00:00] Introduction and catch up with guests * [00:01:55] Databricks' text to image model release * [00:03:46] Details about the DBRX model * [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases * [00:09:18] Challenges of training foundation models and getting infrastructure to work * [00:12:03] Details of Imbue's cluster setup * [00:18:53] Process of bringing machines online and common failures * [00:22:52] Health checks and monitoring for the cluster * [00:25:06] Typical timelines and team composition for setting up a cluster * [00:27:24] Monitoring GPU utilization and performance * [00:29:39] Open source tools and libraries used * [00:32:33] Reproducibility and portability of cluster setup * [00:35:57] Infrastructure changes needed for different model architectures * [00:40:49] Imbue's focus on text-only models for coding and reasoning * [00:42:26] CARBS hyperparameter tuner and cost-aware optimization * [00:51:01] Emergence and CARBS * [00:53:18] Evaluation datasets and reproducing them with high quality * [00:58:40] Challenges of evaluating on more realistic tasks * [01:06:01] Abstract reasoning benchmarks like ARC * [01:10:13] Long context evaluation and needle-in-a-haystack tasks * [01:13:50] Function calling and tool use evaluation * [01:19:19] Imbue's future plans for coding and reasoning applications * [01:20:14] Databricks' future plans for useful applications and upcoming blog posts Transcript SWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome. JOSH [00:00:12]: Hey, glad to be here. SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B? JONATHAN [00:00:30]: Yeah, back when reproducing LLAMA1-7B was considered a huge accomplishment for the field. Those are the good old days. I miss that. SWYX [00:00:38]: As the things have accelerated a lot. Actually, let's do a quick catch up and Josh, you can chime on in as well. So Databricks got acquired. I talked to you at New York. JONATHAN [00:00:45]: Mosaic got acquired, although sometimes it feels like Mosaic acquired Databricks because, you know, we're having a lot of fun being here. But, you know, yeah. SWYX [00:00:52]: Yeah. I mean, you are chief scientist now of Databricks. JONATHAN [00:00:55]: Chief AI scientist. Careful with the title. As much as I would love to understand how Spark works, I'm going to have to defer that to much smarter people than me. SWYX [00:01:03]: Got it. And I don't know about like what you would highlight so far as a post-acquisition, but the most recent news is that you guys released DBRX. Is that the thing that most people should be aware of? JONATHAN [00:01:13]: Actually, that's no longer the most recent news. Honestly, the most recent news, we announced this, but it was at our Data and AI Summit last week. So it was announced among like 100,000 other things, is that we finally released our text to image model, which has been a year in the making through a collaboration directly with Shutterstock. There was a lot of work put into finding a dataset that we were comfortable with working on and trying to build a model that honestly, I felt like I could trust and that others might be able to trust to put out in the world. So that model was released last week. It's unfortunately just available via API due to the fact that the data is quite sensitive and quite valuable. It's Shutterstock's entire business in a lot of ways, but I'm still really excited that there's now a model that is trained on a dataset where the provenance of every single image is known, and it's a damn good model. So I'm really proud of the team on that. SWYX [00:01:55]: Yeah, amazing. Josh, do you have any thoughts on image model questions? JOSH [00:01:59]: That is not my area of expertise, but I was excited to see the release of it last week as well, and very happy that you guys did a nice job on the data side of everything there. So that was cool to see. SWYX [00:02:09]: I think what's unusual is like, I think Shutterstock's doing multiple deals in multiple labs. So what is the Shutterstock model? Like, I guess, is this the house model for Shutterstock? Is this Databricks' version of the Shutterstock model? Like, what is this? JONATHAN [00:02:22]: The way that I would think about it is that Shutterstock is doing an amazing business in AI across the board. Their dataset is kind of widely known to be the best stock photos dataset in the world, the most comprehensive, the biggest. When you think about like, what dataset am I going to train a multimodal model on? You call Shutterstock. And I, at least I've heard in the news, like OpenAI, Google, Meta, Apple have all called Shutterstock and made those deals. So a lot of models have had Shutterstock data incorporated into them. But this is the only model I know of so far where it was, you know, exclusively and specifically trained just on the vanilla Shutterstock data. There was nothing else mixed in. We didn't go and scrape the web and find other data or combined datasets or anything like that. And so this is, in some sense, the house blend. But the other piece is that it's just a dataset where the provenance of every image is known in public. Where did the data come from? It is the Shutterstock collection. That's it. You know, nothing less, nothing more. And certainly being at Databricks, if I've learned one thing, I've learned about enterprise customers and what they want out of AI. And one of the things they ask for most is just, what can you tell me about the data the model was trained on? And here, especially for text to image models, where images are just tricky subject matter, there's been a lot of kind of legal conversation about images, especially. It's nice to just have something where I can point to it and say, you know, if you want to know where the images came from, these are what they are and this is how they got there. SWYX [00:03:36]: I will talk a little bit about Databricks because it's relevant to the rest of today's episode. So Databricks, sorry, I keep misspeaking. It's DBRX. JONATHAN [00:03:46]: DBRX, actually, there's been a pronunciation update. It is now D-B-Rex. So we have decided to add a dinosaur mascot because what model doesn't like a mascot? So literally, I wish I could pull it up. There is a little plush dinosaur that we had made. It's like the world's cutest dinosaur, but it is the official mascot of D-B-Rex. And there's a little dinosaur logo that, you know, you'll probably see around a little bit more because DBRX is a mouthful, but D-B-Rex, like, you know, it's just kind of... SWYX [00:04:13]: Rolls off the tongue. I love mascots. Like every company should have a mascot. And I think Hugging Face got it right. You need an emoji mascot because that's the minimal viable image. JONATHAN [00:04:21]: I probably shouldn't talk at all about, you know, Velociraptor, but, you know, that's a, maybe that's something we can talk about later in the summer. I'll just leave it at that. SWYX [00:04:28]: Okay. That's a hint to names. I feel like your names leak a lot of alpha. So just to quickly cover the headline details, DBRX, as Make Sure Experts model, that's fairly big, 132 billion total parameters, so 36 billion active on any input, pre-trained on 12 trillion tokens of text and code, and did really well on evals to the point where you had to dye your hair blue. That's my high level conclusion. JONATHAN [00:04:53]: Never make a bet with your team two weeks out from model launch, even when, you know, human eval is looking quite bad. Because if you set some bar, even if it's arbitrary and you think there's no way in hell they're going to hit it,
The World’s Fair is officially sold out! Thanks for all the support and stay tuned for recaps of all the great goings on in this very special celebration of the AI Engineer! Longtime listeners will remember the fan favorite Raza Habib, CEO of HumanLoop, on the pod: Well, he’s caught the podcasting bug and is now flipping the tables on swyx! Subscribe to High Agency wherever the finest Artificial Intelligence podcast are sold. High Agency Pod Description In this episode, I chatted with Shawn Wang about his upcoming AI engineering conference and what an AI engineer really is. It's been a year since he penned the viral essay "Rise of the AI Engineer' and we discuss if this new role will be enduring, the make up of the optimal AI team and trends in machine learning. Timestamps 00:00 - Introduction and background on Shawn Wang (Swyx)03:45 - Reflecting on the "Rise of the AI Engineer" essay07:30 - Skills and characteristics of AI Engineers12:15 - Team composition for AI products16:30 - Vertical vs. horizontal AI startups23:00 - Advice for AI product creators and leaders28:15 - Tools and buying vs. building for AI products33:30 - Key trends in AI research and development41:00 - Closing thoughts and information on the AI Engineer World Fair Summit Video This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Editor’s note: One of the top reasons we have hundreds of companies and thousands of AI Engineers joining the World’s Fair next week is, apart from discussing technology and being present for the big launches planned, to hire and be hired! Listeners loved our previous Elicit episode and were so glad to welcome 2 more members of Elicit back for a guest post (and bonus podcast) on how they think through hiring. Don’t miss their AI engineer job description, and template which you can use to create your own hiring plan! How to Hire AI Engineers James Brady, Head of Engineering @ Elicit (ex Spring, Square, Trigger.io, IBM) Adam Wiggins, Internal Journalist @ Elicit (Cofounder Ink & Switch and Heroku) If you’re leading a team that uses AI in your product in some way, you probably need to hire AI engineers. As defined in this article, that’s someone with conventional engineering skills in addition to knowledge of language models and prompt engineering, without being a full-fledged Machine Learning expert. But how do you hire someone with this skillset? At Elicit we’ve been applying machine learning to reasoning tools since 2018, and our technical team is a mix of ML experts and what we can now call AI engineers. This article will cover our process from job description through interviewing. (You can also flip the perspectives here and use it just as easily for how to get hired as an AI engineer!) My own journey Before getting into the brass tacks, I want to share my journey to becoming an AI engineer. Up until a few years ago, I was happily working my job as an engineering manager of a big team at a late-stage startup. Like many, I was tracking the rapid increase in AI capabilities stemming from the deep learning revolution, but it was the release of GPT-3 in 2020 which was the watershed moment. At the time, we were all blown away by how the model could string together coherent sentences on demand. (Oh how far we’ve come since then!) I’d been a professional software engineer for nearly 15 years—enough to have experienced one or two technology cycles—but I could see this was something categorically new. I found this simultaneously exciting and somewhat disconcerting. I knew I wanted to dive into this world, but it seemed like the only path was going back to school for a master’s degree in Machine Learning. I started talking with my boss about options for taking a sabbatical or doing a part-time distance learning degree. In 2021, I instead decided to launch a startup focused on productizing new research ideas on ML interpretability. It was through that process that I reached out to Andreas—a leading ML researcher and founder of Elicit—to see if he would be an advisor. Over the next few months, I learned more about Elicit: that they were trying to apply these fascinating technologies to the real-world problems of science, and with a business model that aligned it with safety goals. I realized that I was way more excited about Elicit than I was about my own startup ideas, and wrote about my motivations at the time. Three years later, it’s clear this was a seismic shift in my career on the scale of when I chose to leave my comfy engineering job at IBM to go through the Y Combinator program back in 2008. Working with this new breed of technology has been more intellectually stimulating, challenging, and rewarding than I could have imagined. Deep ML expertise not required It’s important to note that AI engineers are not ML experts, nor is that their best contribution to a tech team. In our article Living documents as an AI UX pattern, we wrote: It’s easy to think that AI advancements are all about training and applying new models, and certainly this is a huge part of our work in the ML team at Elicit. But those of us working in the UX part of the team believe that we have a big contribution to make in how AI is applied to end-user problems. We think of LLMs as a new medium to work with, one that we’ve barely begun to grasp the contours of. New computing mediums like GUIs in the 1980s, web/cloud in the 90s and 2000s, and multitouch smartphones in the 2000s/2010s opened a whole new era of engineering and design practices. So too will LLMs open new frontiers for our work in the coming decade. To compare to the early era of mobile development: great iOS developers didn’t require a detailed understanding of the physics of capacitive touchscreens. But they did need to know the capabilities and limitations of a multi-touch screen, the constrained CPU and storage available, the context in which the user is using it (very different from a webpage or desktop computer), etc. In the same way, an AI engineer needs to work with LLMs as a medium that is fundamentally different from other compute mediums. That means an interest in the ML side of things, whether through their own self-study, tinkering with prompts and model fine-tuning, or following along in #llm-paper-club. But this understanding is so that they can work with the medium effectively versus, say, spending their days training new models. Language models as a chaotic medium So if we’re not expecting deep ML expertise from AI engineers, what are we expecting? This brings us to what makes LLMs different. We’ll assume already that our ideal candidate is already inspired by, and full of ideas about, all the new capabilities AI can bring to software products. But the flip side is all the things that make this new medium difficult to work with. LLM calls are annoying due to high latency (measured in tens of seconds sometimes, rather than milliseconds), extreme variance on latency, high error rates even under normal operation. Not to mention getting extremely different answers to the same prompt provided to the same model on two subsequent calls! The net effect is that an AI engineer, even working at the application development level, needs to have a skillset comparable to distributed systems engineering. Handling errors, retries, asynchronous calls, streaming responses, parallelizing and recombining model calls, the halting problem, and fallbacks are just some of the day-in-the-life of an AI engineer. Chaos engineering gets new life in the era of AI. Skills and qualities in candidates Let’s put together what we don’t need (deep ML expertise) with what we do (work with capabilities and limitations of the medium). Thus we start to see what Elicit looks for in AI engineers: * Conventional software engineering skills. Especially back-end engineering on complex, data-intensive applications. * Professional, real-world experience with applications at scale. * Deep, hands-on experience across a few back-end web frameworks. * Light devops and an understanding of infrastructure best practices. * Queues, message buses, event-driven and serverless architectures, … there’s no single “correct” approach, but having a deep toolbox to draw from is very important. * A genuine curiosity and enthusiasm for the capabilities of language models. * One or more serious projects (side projects are fine) of using them in interesting ways on a unique domain. * …ideally with some level of factored cognition, e.g. breaking the problem down into chunks, making thoughtful decisions about which things to push to the language model and which stay within the realm of conventional heuristics and compute capabilities. * Personal studying with resources like Elicit’s ML reading list. Part of the role is collaborating with the ML engineers and researchers on our team. To do so, the candidate needs to “speak their language” somewhat, just as a mobile engineer needs some familiarity with backends in order to collaborate effectively on API creation with backend engineers. * An understanding of the challenges that come along with working with large models (high latency, variance, etc.) leading to a defensive, fault-first mindset. * Careful and principled handling of error cases, asynchronous code (and ability to reason about and debug it), streaming data, caching, logging and analytics for understanding behavior in production. * This is a similar mindset that one can develop working on conventional apps which are complex, data-intensive, or large-scale apps. The difference is that an AI engineer will need this mindset even when working on relatively small scales! On net, a great AI engineer will combine two seemingly contrasting perspectives: knowledge of, and a sense of wonder for, the capabilities of modern ML models; but also the understanding that this is a difficult and imperfect foundation, and the willingness to build resilient and performant systems on top of it. Here’s the resulting AI engineer job description for Elicit. And here’s a template that you can borrow from for writing your own JD. Hiring process Once you know what you’re looking for in an AI engineer, the process is not too different from other technical roles. Here’s how we do it, broken down into two stages: sourcing and interviewing. Sourcing We’re primarily looking for people with (1) a familiarity with and interest in ML, and (2) proven experience building complex systems using web technologies. The former is important for culture fit and as an indication that the candidate will be able to do some light prompt engineering as part of their role. The latter is important because language model APIs are built on top of web standards and—as noted above—aren’t always the easiest tools to work with. Only a handful of people have built complex ML-first apps, but fortunately the two qualities listed above are relatively independent. Perhaps they’ve proven (2) through their professional experience and have some side projects which demonstrate (1). Talking of side projects, evidence of creative and original prototypes is a huge plus as we’re evaluating candidates. We’ve barely scratched the surface of what’s possible to build with LLMs—even the current generation of mo
In April 2023 we released an episode named “Mapping the future of *truly* open source models” to talk about Dolly, the first open, commercial LLM. Mike was leading the OSS models team at Databricks at the time. Today, Mike is back on the podcast to give us the “one year later” update on the evolution of large language models and how he’s been using them to build Brightwave, an an AI research assistant for investment professionals. Today they are announcing a $6M seed round (led by Alessio and Decibel!), and sharing some of the learnings from serving customers with >$120B of assets under management in production in the last 4 months since launch. Losing faith in long context windows In our recent “Llama3 1M context window” episode we talked about the amazing progress we have done in context window size, but it’s good to remember that Dolly’s original context size was 1,024 tokens, and this was only 14 months ago. But while understanding length has increased, models are still not able to generate very long answers. His empirical intuition (which matches ours while building smol-podcaster) is that most commercial LLMs, as well as Llama, tend to generate responses most of the time. While Needle in a Haystack tests will pass with flying colors at most context sizes, the granularity of the summary decreases as the context increases as it tries to fit the answer in the same tokens range, rather than returning tokens close to the 4,096 max_output, for example. Recently Rob Mulla from Dreadnode highlighted how LMSys Arena results prefer longer responses by a large margin, so both LLMs and humans have a well documented length bias which doesn’t necessarily track the quality of answer: The way Mike and team solved this is by breaking down the task in multiple subtasks, and then merging them back together. For example, have a book summarized chapter by chapter to preserve more details, and then put those summaries together. In Brightwave’s case, it’s creating multiple subsystems that accomplish different tasks on a large corpus of text separately, and then bringing them all together in a report. For example understanding intent of the question, extracting relations between companies, figuring out if it’s a positive / negative, etc. Mike’s question is whether or not we’ll be able to imbue better synthesis capabilities in the models: can you have synthesis-oriented demonstrations at training time rather than single token prediction? “LLMs as Judges” Strategies In our David Luan episode he mentioned they don’t use any benchmarks for their models, because the benchmarks don’t reflect their customer needs. Brightwave shared some tips on leveraging LLMs as Judges: * Human vs LLM reviews: while they work with human annotators to create high quality datasets, that data isn’t just used to fine tune models but also as a reference basis for future LLM reviews. Having a set of trusted data to use as calibration helps you trust the LLM judgement even more. * Ensemble consistency checking: rather than using an LLM as judge for one output, you use different LLMs to generate a result for the same task, and then use another LLM to highlight where those generations differ. Do the two outputs differ meaningfully? Do they have different beliefs about the implications of something? If there are a lot of discrepancies between generations coming from different models, you then do additional passes to try and resolve them. * Entailment verification: for each unique insight that they generate, they take the output and separately ask LLMs to verify factuality of information based on the original sources. In the actual product, user can then highlight any piece of text and ask it to 1) “Tell Me More” 2) “Show Sources”. Since there’s no way to guarantee factuality of 100% of outputs, and humans have good intuition for things that look out of the ordinary, giving the user access to the review tool helps them build trust in it. It’s all about the data During his time at Databricks, they had created dolly-15k, a dataset of instruction-following records written by thousands of their employees. Since then, no other company has replicated that type of effort even though the data wars are in full effect. It’s been clear in the last year that the half-life of a model is much shorter than the half-life of a dataset. The Pile by Eleuther (see Datasets 101) came out in 2020 and is still widely used; if you had trained an LLM in 2020, you would have definitely replaced it by now as they have gotten better and cheaper. On the age old “RAG v Fine-Tuning” question, Mike shared a great example that we’ll just quote: I think of language models kind of like a stem cell, and then under fine tuning, they differentiate into different kinds of specific cells. I don't think that unbounded agentic behaviors are useful, and that instead, a useful LLM system is more like a finite state machine where the behavior of the system is occupying one of many different behavioral regimes and making decisions about what state should I occupy next in order to satisfy the goal. As you think about the graph of those states that your system is moving through, once you develop conviction that one behavior is useful and repeatable and worthwhile to differentiate down into a specific kind of subsystem, that's where like fine tuning and specifically generating the training data, like having human annotators produce a corpus that is useful enough to get a specific class of behaviors, that's kind of how we use fine tuning rather than trying to imbue net new information into these systems. There are a lot of other nuggets in the episode around knowledge graphs extraction, private vs public data, user intent extraction, etc, but we only have so much room in the writeup so go listen! And if you’re interested in working on these problems, Brightwave is hiring 👀 Watch on YouTube We like Mike. The camera likes Mike. Our audience loooves Mike. Show Notes * Brightwave * Mike Conover * Mike on Latent Space #1 * Nature paper on S&P 500 talent movement * Dolly announcement * Dolly 15K dataset * Bard blog post on double-checking generation * RLHF 201 episode * David Luan Episode * Red Pajama * Snorkel * Renaissance Timestamps * [00:00:00] Introductions * [00:02:40] Social media's polarization influence on LLMs * [00:04:09] What's Brightwave? * [00:05:13] How to hire for a vertical AI startup * [00:09:34] How $20B+ hedge funds use Brightwave * [00:11:23] Evolution of context sizes in language models * [00:14:36] Summarizing vs Ideating with AI * [00:18:26] Collecting feedback in a field with no truth * [00:20:49] Evaluation strategies and the importance of custom datasets * [00:23:43] Should more companies make employees label data? * [00:25:32] Retrieval for highly temporal and hierarchical data * [00:30:05] Context-aware prompting for private vs. public data * [00:32:01] Knowledge graph extraction and structured information retrieval * [00:33:49] Fine-tuning vs RAG * [00:36:16] Anthropomorphizing language models * [00:38:20] Why Brightwave doesn't do spreadsheets * [00:42:24] Will there be fully autonomous hedge funds? * [00:47:58] State of open source AI * [00:53:53] Hiring and team expansion at Brightwave Transcript Alessio [00:00:01]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I have no co-host today. Swyx is in Vienna at ICLR having fun in Europe, and we're in the brand new studio. As you might see, if you're on YouTube, there's still no sound panels on the wall. Mike tried really hard to put them up, but the glue is a little too old for that. So if you hear any echo or anything like that, sorry, but we're doing the best that we can. And today we have our first repeat guest, Mike Conover. Welcome Mike, who's now the founder of Brightwave, not Databricks anymore. Mike [00:00:40]: That's right. Yeah. Pleased to be back. Alessio [00:00:42]: Our last episode was one of the fan favorites, and I think this will be just as good. So for those that have not listened to the first episode, which might be many because the podcast has grown a lot since then, thanks to people like Mike who have interesting conversations on it. You spent a bunch of years doing ML at some of the best companies on the internet, things like Workday, you know, Skipflag, LinkedIn, most recently at Databricks where you were leading the open source large language models team working on Dolly. And now you're doing Brightwave, which is in the financial services space. But this is not something new, I think when you and I first talked about Brightwave, I was like, why is this guy doing a financial services company? And then you look at your background and you were doing papers on The Nature Magazine about LinkedIn data predicting S&P 500 stock movement, like many, many years ago. So what are some of the tying elements in your background that maybe people are overlooking that brought you to do this? Mike [00:01:36]: Yeah, sure. Yeah. So my PhD research was funded by DARPA and we had access to the Twitter data set early in the natural history of the availability of that data set, and it was focused on the large scale structure of propaganda and misinformation campaigns. And LinkedIn, we had planet scale descriptions of the structure of the global economy. And so primarily my work was homepage news feed relevant. So when you go to LinkedIn.com, you'd see updates from one of our machine learning models. But additionally, I was a research liaison as part of the economic graph challenge and had this Nature Communications paper where we demonstrated that 500 million jobs transitions can be hierarchically clustered as a network of labor flows and could predict next quarter S&P 500 market gap changes. And at Workday, I was director of financials machine learning. You start to see how organizat
Our second wave of speakers for AI Engineer World’s Fair were announced! The conference sold out of Platinum/Gold/Silver sponsors and Early Bird tickets! See our Microsoft episode for more info and buy now with code LATENTSPACE. This episode is straightforwardly a part 2 to our ICLR 2024 Part 1 episode, so without further ado, we’ll just get right on with it! Timestamps [00:03:43] Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger * [00:07:44] WebArena * [00:18:45] Sotopia * [00:24:00] Performance Improving Code Edits * [00:29:39] OpenDevin * [00:47:40] Industry and Academia [01:05:29] Section B: Benchmarks * [01:05:52] SWEBench * [01:17:05] SWEBench/SWEAgent Interview * [01:27:40] Dataset Contamination Detection * [01:39:20] GAIA Benchmark * [01:49:18] Moritz Hart - Science of Benchmarks [02:36:32] Section C: Reasoning and Post-Training * [02:37:41] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection * [02:51:00] Let’s Verify Step By Step * [02:57:04] Noam Brown * [03:07:43] Lilian Weng - Towards Safe AGI * [03:36:56] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis * [03:48:43] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework [04:00:51] Bonus: Notable Related Papers on LLM Capabilities Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger * Guests * Graham Neubig * Aman Sanger - Previous guest and NeurIPS friend of the pod! * WebArena * * Sotopia (spotlight paper, website) * * Learning Performance-Improving Code Edits * OpenDevin * Junyang Opendevin * Morph Labs, Jesse Han * SWE-Bench * SWE-Agent * Aman tweet on swebench * LiteLLM * Livecodebench * the role of code in reasoning * Language Models of Code are Few-Shot Commonsense Learners * Industry vs academia * the matryoshka embeddings incident * other directions * Unlimiformer Section A timestamps * [00:00:00] Introduction to Guests and the Impromptu Nature of the Podcast * [00:00:45] Graham's Experience in Japan and Transition into Teaching NLP * [00:01:25] Discussion on What Constitutes a Good Experience for Students in NLP Courses * [00:02:22] The Relevance and Teaching of Older NLP Techniques Like Ngram Language Models * [00:03:38] Speculative Decoding and the Comeback of Ngram Models * [00:04:16] Introduction to WebArena and Zotopia Projects * [00:05:19] Deep Dive into the WebArena Project and Benchmarking * [00:08:17] Performance Improvements in WebArena Using GPT-4 * [00:09:39] Human Performance on WebArena Tasks and Challenges in Evaluation * [00:11:04] Follow-up Work from WebArena and Focus on Web Browsing as a Benchmark * [00:12:11] Direct Interaction vs. Using APIs in Web-Based Tasks * [00:13:29] Challenges in Base Models for WebArena and the Potential of Visual Models * [00:15:33] Introduction to Zootopia and Exploring Social Interactions with Language Models * [00:16:29] Different Types of Social Situations Modeled in Zootopia * [00:17:34] Evaluation of Language Models in Social Simulations * [00:20:41] Introduction to Performance-Improving Code Edits Project * [00:26:28] Discussion on DevIn and the Future of Coding Agents * [00:32:01] Planning in Coding Agents and the Development of OpenDevon * [00:38:34] The Changing Role of Academia in the Context of Large Language Models * [00:44:44] The Changing Nature of Industry and Academia Collaboration * [00:54:07] Update on NLP Course Syllabus and Teaching about Large Language Models * [01:00:40] Call to Action: Contributions to OpenDevon and Open Source AI Projects * [01:01:56] Hiring at Cursor for Roles in Code Generation and Assistive Coding * [01:02:12] Promotion of the AI Engineer Conference Section B: Benchmarks * Carlos Jimenez & John Yang (Princeton) et al: SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR Oral, Paper, website) * “We introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.” * Yonatan Oren et al (Stanford): Proving Test Set Contamination in Black-Box Language Models (ICLR Oral, paper, aman tweet on swebench contamination) * “We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. * We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus.” * Outstanding Paper mention: “A simple yet elegant method to test whether a supervised-learning dataset has been included in LLM training.” * Thomas Scialom (Meta AI-FAIR w/ Yann LeCun): GAIA: A Benchmark for General AI Assistants (paper) * “We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. * GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. * GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. * * Mortiz Hardt (Max Planck Institute): The emerging science of benchmarks (ICLR stream) * “Benchmarks are the keystone that hold the machine learning community together. Growing as a research paradigm since the 1980s, there’s much we’ve done with them, but little we know about them. In this talk, I will trace the rudiments of an emerging science of benchmarks through selected empirical and theoretical observations. Specifically, we’ll discuss the role of annotator errors, external validity of model rankings, and the promise of multi-task benchmarks. The results in each case challenge conventional wisdom and underscore the benefits of developing a science of benchmarks.” Section C: Reasoning and Post-Training * Akari Asai (UW) et al: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (ICLR oral, website) * (Bad RAG implementations) indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. * We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. * Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. * Self-RAG (7B and 13B parameters) outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning, and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models. * Hunter Lightman (OpenAI): Let’s Verify Step By Step (paper) * “Even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. * We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. * To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best rew
AI Engineer World’s Fair in SF! Prices go up soon. Note that there are 4 tracks per day and dozens of workshops/expo sessions; the livestream will air the most stacked speaker list/AI expo floor of 2024. Apply for free/discounted Diversity Program and Scholarship tickets here. We hope to make this the definitive technical conference for ALL AI engineers. Exactly a year ago, we declared the Beginning of Context=Infinity when Mosaic made their breakthrough training an 84k token context MPT-7B. A Brief History of Long Context Of course right when we released that episode, Anthropic fired the starting gun proper with the first 100k context window model from a frontier lab, spawning smol-developer and other explorations. In the last 6 months, the fight (and context lengths) has intensified another order of magnitude, kicking off the "Context Extension Campaigns" chapter of the Four Wars: * In October 2023, Claude's 100,000 token windows was still SOTA (we still use it for Latent Space’s show notes to this day). * On November 6th, OpenAI launched GPT-4 Turbo with 128k context. * On November 21st, Anthropic fired back extending Claude 2.1 to 200k tokens. * Feb 15 (the day everyone launched everything) was Gemini's turn, announcing the first LLM with 1 million token context window. * In May 2024 at Google I/O, Gemini 1.5 Pro announced a 2m token context window In parallel, open source/academia had to fight its own battle to keep up with the industrial cutting edge. Nous Research famously turned a reddit comment into YaRN, extending Llama 2 models to 128k context. So when Llama 3 dropped, the community was ready, and just weeks later, we had Llama3 with 4M+ context! A year ago we didn’t really have an industry standard way of measuring context utilization either: it’s all well and good to technically make an LLM generate non-garbage text at 1m tokens, but can you prove that the LLM actually retrieves and attends to information inside that long context? Greg Kamradt popularized the Needle In A Haystack chart which is now a necessary (if insufficient) benchmark — and it turns out we’ve solved that too in open source: Today's guest, Mark Huang, is the co-founder of Gradient, where they are building a full stack AI platform to power enterprise workflows and automations. They are also the team behind the first Llama3's 1M+ and 4M+ context window finetunes. Long Context Algorithms: RoPE, ALiBi, and Ring Attention Positional encodings allow the model to understand the relative position of tokens in the input sequence, present in what (upcoming guest!) Yi Tay affectionately calls the OG “Noam architecture”. But if we want to increase a model’s context length, these encodings need to gracefully extrapolate to longer sequences. ALiBi, used in models like MPT (see our "Context=Infinity" episode with the MPT leads, Jonathan Frankle and Abhinav), was one of the early approaches to this space. It lets the context window stretch as it grows, using a linearly decreasing penalty between attention weights of different positions; the further two tokens are, the higher the penalty. Of course, this isn’t going to work for usecases that actually require global attention across a long context. In more recent architectures and finetunes, RoPE (Rotary Position Embedding) encoding is more commonly used and is also what Llama3 was based on. RoPE uses a rotational matrix to encode positions, which empirically performs better for longer sequences. The main innovation from Gradient was to focus on tuning the theta hyperparameter that governs the frequency of the rotational encoding. Audio note: If you want the details, jump to 15:55 in the podcast (or scroll down to the transcript!) By carefully increasing theta as context length grew, they were able to scale Llama3 up to 1 million tokens and potentially beyond. Once you've scaled positional embeddings, there's still the issue of attention's quadratic complexity, and how longer and longer sequences impacts models speed and scaling abilities. Getting to 1-4M context window requires a fairly large amount of compute, so efficiency matters. Ring Attention was the other "one small trick that GPU clouds hate" that improves GPU utilization by allowing parallel computation and communication between GPUs. Gradient started from the EasyContext library as implementation of Ring Attention in PyTorch, since the original one was in JAX. Long Context Data: Curriculum Learning and Progressive Extension The use of curriculum learning when extending context was another new approach; rather than training Llama3 on the full 1 million token context from the start, they progressively increased the sequence length over the course of training. Intuitively, it allows the model to first learn to utilize shorter contexts before tackling the full length, but it only works if data gets more and more "tricky" in long context situation. For the generic pre-training corpus they used SlimPajama as a base, and concatenated texts to reach the target length, while monitoring for diversity in the data. Datasets that only required attending to the last few tokens, for instance, would fail to teach long-range reasoning. To fix that, they used synthetic data (another one of our Four Wars of AI!) with GPT-4 to augment their datasets by prompting it to expand on information or rephrase excerpts. Another paper we previously mentioned in this space is "Rephrasing The Web". Long Context Benchmarking: Beyond Needles Long context is cool, but does it work? Greg’s now-famous "needle in a haystack" (NIAH) test, which measures a model's ability to extract a piece of information embedded in a long context, is a clean standard that everyone uses to start, but it is a little simplistic and the community has since created many options to extend it: * RULER: Outside of various NIAH tests (single value, multiple values, etc) it also tests for things like "most frequent words" and "variable tracking", which is very helpful especially in coding use cases. * LooGLE: Focuses on three main area: scientific papers, Wikipedia articles, movie and TV scripts. "Timeline reorder" is an interesting challenge in their benchmark, which asks model to create a timeline out of events that happened out of order in the text. * Infinite Bench: First created in November 2023, most avg input tokens tasks are in the 100-200k tokens range across retrieval, Q&A, and code debugging. * ZeroSCROLLS: this comes with a public leaderboard where you can see models performance, as well as tasks that you can browse to get an idea. The 4M context size seemed to be the limit where things started to fall apart as far as performance goes, which is quite impressive! Show Notes * Mark Huang * Gradient * Chris Chang * HuggingFace Hub with Llama3 finetunes * Mad Men * Crusoe * Greg Kamradt's Needle in a Haystack * Chameleon paper * Charles Goddard (Mentioned in context with model merging) * Matei Zaharia * Phil Wang (lucidrains) * Wing Lian * Zhang Peiyuan * Yi * Scaling Laws of RoPE-based Extrapolation * ALiBi * YaRN * Ring Attention * Easy Context * StrongCompute * LoRa * RULER: What's the Real Context Size of Your Long-Context Language Models? * LooGLE: Can Long-Context Language Models Understand Long Contexts? * Infinite Bench * BAMBOO * ZeroSCROLLS: Zero-Shot CompaRison Over Long Language Sequences * DeepSeek paper * Multi-head Latent Attention Chapters * [00:00:01] Introductions * [00:01:28] Founding story of Gradient and its mission * [00:03:50] "Minimum viable agents" * [00:07:37] Differentiating ML and AI, focusing on out-of-domain generalization * [00:08:19] Extending Llama3 to 1M tokens * [00:11:41] Technical challenges with long context sequences * [00:14:30] Data quality and the importance of diverse datasets * [00:16:07] What's a theta value? * [00:18:27] RoPE vs Ring Attention vs ALiBi vs YaARN * [00:20:23] Why RingAttention matters * [00:22:47] How to refine datasets for context extension * [00:27:28] Multi-stage training data and avoiding overfitting to recent data * [00:28:10] The potential of using synthetic data in training * [00:31:21] Applying LoRa adapters to extend model capabilities * [00:34:45] Benchmarking long context models and evaluating their performance * [00:38:38] Pushing to 4M context and output quality degradation * [00:40:49] What do you need this context for? * [00:42:54] Impact of long context in chat vs Docs Summarization * [00:45:35] Future directions for long context models and multimodality * [00:48:01] How do you know what research matters? * [00:50:31] Routine for staying updated with AI research and industry news * [00:52:39] Deciding which AI developments to invest time in * [00:56:08] Request for collaboration and data set construction for long context Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:14]: Hey, and today we're in the remote studio with Mark Wang from Gradient. Welcome Mark. Mark [00:00:19]: Hey, glad to be here. It's really a great experience to be able to talk with you all. I know your podcast is really, really interesting and I always am listening to it every time you guys have a release. Alessio [00:00:31]: He's not a paid actor. He said that out of his own will. Swyx [00:00:34]: We'll give you the check later. So you're unusual in the sense that you and I go back to college. I don't exactly remember where we overlapped, but you know, we both went to Wharton. We went into the sort of quantitative developer realm. Mark [00:00:46]: Yeah, exactly. Kind of crazy, right? So it all goes full circle. I was a quant for quite a few years and then made it out into Silicon Valley and now we intersect again when it kind of feels like more or less the same, right? Like the AI wars, the trading wars back in the day too, to a certain extent and the gr
Speakers for AI Engineer World’s Fair have been announced! See our Microsoft episode for more info and buy now with code LATENTSPACE — we’ve been studying the best ML research conferences so we can make the best AI industry conf! Note that this year there are 4 main tracks per day and dozens of workshops/expo sessions; the free livestream will air much less than half of the content this time. Apply for free/discounted Diversity Program and Scholarship tickets here. We hope to make this the definitive technical conference for ALL AI engineers. UPDATE: This is a 2 part episode - see Part 2 here. ICLR 2024 took place from May 6-11 in Vienna, Austria. Just like we did for our extremely popular NeurIPS 2023 coverage, we decided to pay the $900 ticket (thanks to all of you paying supporters!) and brave the 18 hour flight and 5 day grind to go on behalf of all of you. We now present the results of that work! This ICLR was the biggest one by far, with a marked change in the excitement trajectory for the conference: Of the 2260 accepted papers (31% acceptance rate), of the subset of those relevant to our shortlist of AI Engineering Topics, we found many, many LLM reasoning and agent related papers, which we will cover in the next episode. We will spend this episode with 14 papers covering other relevant ICLR topics, as below. As we did last year, we’ll start with the Best Paper Awards. Unlike last year, we now group our paper selections by subjective topic area, and mix in both Outstanding Paper talks as well as editorially selected poster sessions. Where we were able to do a poster session interview, please scroll to the relevant show notes for images of their poster for discussion. To cap things off, Chris Ré’s spot from last year now goes to Sasha Rush for the obligatory last word on the development and applications of State Space Models. We had a blast at ICLR 2024 and you can bet that we’ll be back in 2025 🇸🇬. Timestamps and Overview of Papers [00:02:49] Section A: ImageGen, Compression, Adversarial Attacks * [00:02:49] VAEs * [00:32:36] Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models * [00:37:25] The Hidden Language Of Diffusion Models * [00:48:40] Ilya on Compression * [01:01:45] Christian Szegedy on Compression * [01:07:34] Intriguing properties of neural networks [01:26:07] Section B: Vision Learning and Weak Supervision * [01:26:45] Vision Transformers Need Registers * [01:38:27] Think before you speak: Training Language Models With Pause Tokens * [01:47:06] Towards a statistical theory of data selection under weak supervision * [02:00:32] Is ImageNet worth 1 video? [02:06:32] Section C: Extending Transformers and Attention * [02:06:49] LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models * [02:15:12] YaRN: Efficient Context Window Extension of Large Language Models * [02:32:02] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs * [02:44:57] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training [02:54:26] Section D: State Space Models vs Transformers * [03:31:15] Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors * [03:37:08] End of Part 1 A: ImageGen, Compression, Adversarial Attacks * Durk Kingma (OpenAI/Google DeepMind) & Max Welling: Auto-Encoding Variational Bayes (Full ICLR talk) * Preliminary resources: Understanding VAEs, CodeEmporium, Arxiv Insights * Inaugural ICLR Test of Time Award! “Probabilistic modeling is one of the most fundamental ways in which we reason about the world. This paper spearheaded the integration of deep learning with scalable probabilistic inference (amortized mean-field variational inference via a so-called reparameterization trick), giving rise to the Variational Autoencoder (VAE).” * Pablo Pernías (Stability) et al: Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models (ICLR oral, poster) * Hila Chefer et al (Google Research): Hidden Language Of Diffusion Models (poster) * See also: Google Lumiere, Attend and Excite * Christian Szegedy (X.ai): Intriguing properties of neural networks (Full ICLR talk) * Ilya Sutskever: An Observation on Generalization * on Language Modeling is Compression * “Stating The Obvious” criticism * Really good compression amounts to intelligence * Lexinvariant Language models * Inaugural Test of Time Award runner up: “With the rising popularity of deep neural networks in real applications, it is important to understand when and how neural networks might behave in undesirable ways. This paper highlighted the issue that neural networks can be vulnerable to small almost imperceptible variations to the input. This idea helped spawn the area of adversarial attacks (trying to fool a neural network) as well as adversarial defense (training a neural network to not be fooled). “ * with Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus B: Vision Learning and Weak Supervision * Timothée Darcet (Meta) et al : Vision Transformers Need Registers (ICLR oral, Paper) * ICLR Outstanding Paper Award: “This paper identifies artifacts in feature maps of vision transformer networks, characterized by high-norm tokens in low-informative background areas. The authors provide key hypotheses for why this is happening and provide a simple yet elegant solution to address these artifacts using additional register tokens, enhancing model performance on various tasks. The insights gained from this work can also impact other application areas. The paper is very well-written and provides a great example of conducting research – identifying an issue, understanding why it is happening, and then providing a solution.“ * HN discussion: “According to the paper, the "registers" are additional learnable tokens that are appended to the input sequence of a Vision Transformer model during training. They are added after the patch embedding layer, with a learnable value, similar to the [CLS] token and then at the end of the Vision Transformer, the register tokens are discarded, and only the [CLS] token and patch tokens are used as image representations. The register tokens provide a place for the model to store, process and retrieve global information during the forward pass, without repurposing patch tokens for this role. Adding register tokens removes the artifacts and high-norm "outlier" tokens that otherwise appear in the feature maps of trained Vision Transformer models. Using register tokens leads to smoother feature maps, improved performance on dense prediction tasks, and enables better unsupervised object discovery compared to the same models trained without the additional register tokens. This is a neat result. For just a 2% increase in inference cost, you can significantly improve ViT model performance. Close to a free lunch.” * Sachin Goyal (Google) et al: Think before you speak: Training Language Models With Pause Tokens (OpenReview) * We operationalize this idea by performing training and inference on language models with a (learnable) pause token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate pause-training on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. * Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of 18% EM score on the QA task of SQuAD, 8% on CommonSenseQA and 1% accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm. * Pulkit Tandon (Granica) et al: Towards a statistical theory of data selection under weak supervision (ICLR Oral, Poster, Paper) * Honorable Mention: “The paper establishes statistical foundations for data subset selection and identifies the shortcomings of popular data selection methods.” * Shashank Venkataramanan (Inria) et al: Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video (ICLR Oral, paper) * First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. * Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DoRA leads to attention maps that DiscOver and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks. * Honorable Mention: “The paper proposes a novel path to self-supervised image pre-training, by learning from continuous videos. The paper contributes both new types of data and a method to learn from novel data.“ C: Extending Transformers and Attention * Yukang Chen (CUHK) et al: LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (ICLR Oral, Poster) * We present LongLoRA, an efficient fine-tuning a
Disclaimer: today’s episode touches on NSFW topics. There’s no graphic content or explicit language, but we wouldn’t recommend blasting this in work environments. Product website: https://usewhisper.me/ For over 20 years it’s been an open secret that porn drives many new consumer technology innovations, from VHS and Pay-per-view to VR and the Internet. It’s been no different in AI - many of the most elite Stable Diffusion and Llama enjoyers and merging/prompting/PEFT techniques were born in the depths of subreddits and 4chan boards affectionately descibed by friend of the pod as The Waifu Research Department. However this topic is very under-covered in mainstream AI media because of its taboo nature. That changes today, thanks to our new guest Jesse Silver. The AI Waifu Explosion In 2023, the Valley’s worst kept secret was how much the growth and incredible retention of products like Character.ai & co was being boosted by “ai waifus” (not sure what the “husband” equivalent is, but those too!). And we can look at subreddit growth as a proxy for the general category explosion (10x’ed in the last 8 months of 2023): While all the B2B founders were trying to get models to return JSON, the consumer applications made these chatbots extremely engaging and figured out how to make them follow their instructions and “personas” very well, with the greatest level of scrutiny and most demanding long context requirements. Some of them, like Replika, make over $50M/year in revenue, and this is -after- their controversial update deprecating Erotic Roleplay (ERP). A couple of days ago, OpenAI announced GPT-4o (see our AI News recap) and the live voice demos were clearly inspired by the movie Her. The Latent Space Discord did a watch party and both there and on X a ton of folks were joking at how flirtatious the model was, which to be fair was disturbing to many: From Waifus to Fan Platforms Where Waifus are known by human users to be explicitly AI chatbots, the other, much more challenging end of the NSFW AI market is run by AIs successfully (plausibly) emulating a specific human personality for chat and ecommerce. You might have heard of fan platforms like OnlyFans. Users can pay for a subscription to a creator to get access to private content, similarly to Patreon and the likes, but without any NSFW restrictions or any other content policies. In 2023, OnlyFans had over $1.1B of revenue (on $5.6b of GMV). The status quo today is that a lot of the creators outsource their chatting with fans to teams in the Philippines and other lower cost countries for ~$3/hr + 5% commission, but with very poor quality - most creators have fired multiple teams for poor service. Today’s episode is with Jesse Silver; along with his co-founder Adam Scrivener, they run a SaaS platform that helps creators from fan platforms build AI chatbots for their fans to chat with, including selling from an inventory of digital content. Some users generate over $200,000/mo in revenue. We talked a lot about their tech stack, why you need a state machine to successfully run multi-thousand-turn conversations, how they develop prompts and fine-tune models with DSPy, the NSFW limitations of commercial models, but one of the most interesting points is that often users know that they are not talking to a person, but choose to ignore it. As Jesse put it, the job of the chatbot is “keep their disbelief suspended”. There’s real money at stake (selling high priced content, at hundreds of dollars per day per customer). In December the story of the $1 Chevy Tahoe went viral due to a poorly implemented chatbot: Now imagine having to run ecommerce chatbots for a potentially $1-4b total addressable market. That’s what these NSFW AI pioneers are already doing today. Show Notes For obvious reasons, we cannot link to many of the things that were mentioned :) * Jesse on X * Character AI * DSPy Chapters * [00:00:00] Intros * [00:00:24] Building NSFW AI chatbots * [00:04:54] AI waifu vs NSFW chatbots * [00:09:23] Technical challenges of emulating humans * [00:13:15] Business model and economics of the service * [00:15:04] Imbueing personality in AI * [00:22:52] Finetuning LLMs without "OpenAI-ness" * [00:29:42] Building evals and LLMs as judges * [00:36:21] Prompt injections and safety measures * [00:43:02] Dynamics with fan platforms and potential integrations * [00:46:57] Memory management for long conversations * [00:48:28] Benefits of using DSPy * [00:49:41] Feedback loop with creators * [00:53:24] Future directions and closing thoughts Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:14]: Hey, and today we are back in the remote studio with a very special guest, Jesse Silver. Jesse, welcome. You're an unusual guest on our pod. Jesse [00:00:23]: Thank you. So happy to be on. Swyx [00:00:24]: Jesse, you are working a unnamed, I guess, agency. It describes itself as a creator tool for, basically the topic that we're trying to get our arms around today is not safe for work, AI chatbots. I put a call out, your roommate responded to me and put us in touch and we took a while to get this episode together. But I think a lot of people are very interested in the state of the arts, this business and the psychology that you've discovered and the technology. So we had a prep call discussing this and you were kindly agreeing to just share some insights because I think you understand the work that you've done and I think everyone's curious. Jesse [00:01:01]: Yeah. Very happy to launch into it. Swyx [00:01:03]: So maybe we'll just start off with the most obvious question, which is how did you get into the chatbot business? Jesse [00:01:08]: Yeah. So I'll also touch on a little bit of industry context as well. So back in January, 2023, I was looking for sort of a LLM based company to start. And a friend of mine was making about $5K a month doing OnlyFans. And she's working 8 to 10 hours a day. She's one-on-one engaging with her fans, it's time consuming, it's draining, it looks fairly easily automatable. And so there's this clear customer need. And so I start interviewing her and interviewing her friends. And I didn't know too much about the fan platform space before this. But generally in the adult industry, there are these so-called fan platforms like OnlyFans. That's the biggest one. We don't happen to work with them. We work with other fan platforms. And on these platforms, a sex worker that we call a creator can make a profile, and a fan can subscribe to that profile and see sort of exclusive pictures and videos, and then have the chance to interact with that creator on the profile and message them one-on-one. And so these platforms are huge. OnlyFans I think does about 6 billion per year in so-called GMV or gross merchandise value, which is just the value of all of the content sold on the platform. And then the smaller platforms that are growing are doing probably 4 billion a year. And one of the surprising facts that I learned is that most of the revenue generated on a well-run profile on one of these platforms is from chatting. So like about 80%. And this is from creators doing these sort of painstaking interactions with fans. So they're chatting with them, they're trying to sell them videos, they're building relationships with them. It's very time consuming. Fans might not spend. And furthermore, the alternatives that creators have to just grinding it out themselves are not very good. They can run an offshore team, which is just difficult to do, and you have to hire a lot of people. The internet is slow in other countries where offshoring is common. Or they could work with agencies. And so we're not an agency. Agencies do somewhat different stuff, but agencies are not very good. There are a few good ones, but in general, they have a reputation for charging way too much. They work with content, which we don't work with. They work with traffic. And so overall, this landscape became apparent to me where you have these essentially small and medium businesses, these creators, and they're running either anywhere between a few thousand a month to 200k a month in earnings to themselves with no state of the art tools and no good software tools just because it sucks. And so it's this weird, incredibly underserved market. Creators have bad alternatives. And so I got together with a friend of mine to think about the problem who ended up becoming my co-founder. We said, let's build a product that automates what creators are doing to earn money. Let's automate this most difficult and most profitable action they do, which is building relationships with fans, texting them, holding these so-called sexting sessions, selling media from the vault, negotiating custom content, stuff like that, earn creators more money, save them tons of time. And so we developed a prototype and went to AVN, which is one of the largest fan conferences, and just sort of pitched it to people in mainstream porn. And we got like $50k in GMV and profiles to work with. And that allowed us just to start bootstrapping. And it's been about a year. We turned the prototype into a more developed product in December, relaunched it. We treat it the same as any other industry. It just happens to be that people have preconceptions about it. They don't have sweet AI tooling, and there are not a lot of VC-funded competitors in the space. So now we've created a product with fairly broad capabilities. We've worked with over 150 creators. We're talking with like 50k users per day. That's like conversations back and forth. And we're on over 2 million in creator account size per month. Alessio [00:04:54]: I have so many follow-up questions to this. I think the first thing that comes to mind is, at the time, what did you see other people build
We are 200 people over our 300-person venue capacity for AI UX 2024, but you can subscribe to our YouTube for the video recaps. Our next event, and largest EVER, is the AI Engineer World’s Fair. See you there! Parental advisory: Adult language used in the first 10 mins of this podcast. Any accounting of Generative AI that ends with RAG as its “final form” is seriously lacking in imagination and missing out on its full potential. While AI generation is very good for “spicy autocomplete” and “reasoning and retrieval with in context learning”, there’s a lot of untapped potential for simulative AI in exploring the latent space of multiverses adjacent to ours. GANs Many research scientists credit the 2017 Transformer for the modern foundation model revolution, but for many artists the origin of “generative AI” traces a little further back to the Generative Adversarial Networks proposed by Ian Goodfellow in 2014, spawning an army of variants and Cats and People that do not exist: We can directly visualize the quality improvement in the decade since: GPT-2 Of course, more recently, text generative AI started being too dangerous to release in 2019 and claiming headlines. AI Dungeon was the first to put GPT2 to a purely creative use, replacing human dungeon masters and DnD/MUD games of yore. More recent gamelike work like the Generative Agents (aka Smallville) paper keep exploring the potential of simulative AI for game experiences. ChatGPT Not long after ChatGPT broke the Internet, one of the most fascinating generative AI finds was Jonas Degrave (of Deepmind!)’s Building A Virtual Machine Inside ChatGPT: The open-ended interactivity of ChatGPT and all its successors enabled an “open world” type simulation where “hallucination” is a feature and a gift to dance with, rather than a nasty bug to be stamped out. However, further updates to ChatGPT seemed to “nerf” the model’s ability to perform creative simulations, particularly with the deprecation of the `completion` mode of APIs in favor of `chatCompletion`. WorldSim (https://worldsim.nousresearch.com/) It is with this context we explain WorldSim and WebSim. We recommend you watch the WorldSim demo video on our YouTube for the best context, but basically if you are a developer it is a Claude prompt that is a portal into another world of your own choosing, that you can navigate with bash commands that you make up. The live video demo was highly enjoyable: Why Claude? Hints from Amanda Askell on the Claude 3 system prompt gave some inspiration, and subsequent discoveries that Claude 3 is "less nerfed” than GPT 4 Turbo turned the growing Simulative AI community into Anthropic stans. WebSim (https://websim.ai/) This was a one day hackathon project inspired by WorldSim that should have won: In short, you type in a URL that you made up, and Claude 3 does its level best to generate a webpage that doesn’t exist, that would fit your URL. All form POST requests are intercepted and responded to, and all links lead to even more webpages, that don’t exist, that are generated when you make them. All pages are cachable, modifiable and regeneratable - see WebSim for Beginners and Advanced Guide. In the demo I saw we were able to “log in” to a simulation of Elon Musk’s Gmail account, and browse examples of emails that would have been in that universe’s Elon’s inbox. It was hilarious and impressive even back then. Since then though, the project has become even more impressive, with both Siqi Chen and Dylan Field singing its praises: Joscha Bach Joscha actually spoke at the WebSim Hyperstition Night this week, so we took the opportunity to get his take on Simulative AI, as well as a round up of all his other AI hot takes, for his first appearance on Latent Space. You can see it together with the full 2hr uncut demos of WorldSim and WebSim on YouTube! Timestamps * [00:01:59] WorldSim at Replicate HQ * [00:11:03] WebSim at AGI House SF * [00:22:02] Joscha Bach at Hyperstition Night * [00:27:55] Liquid AI * [00:30:30] Small Powerful Based Models * [00:33:22] Interpretability * [00:36:42] Devin vs WebSim * [00:41:34] Is WebSim just Art? Something More? * [00:43:32] We are past the Singularity * [00:47:14] Prompt Engineering Nuances * [00:50:14] On Wikipedia Transcripts [00:00:00] AI Charlie: Welcome to the Latent Space Podcast. This is Charlie, your AI co host. Most of the time, Swyx and Alessio cover generative AI that is meant to use at work, and this often results in RAG applications, vertical copilots, and other AI agents and models. In today's episode, we're looking at a more creative side of generative AI that has gotten a lot of community interest this April. [00:00:35] World Simulation, Web Simulation, and Human Simulation. Because the topic is so different than our usual, we're also going to try a new format for doing it justice. This podcast comes in three parts. First, we'll have a segment of the WorldSim demo from Noose Research CEO Karen Malhotra, recorded by SWYX at the Replicate HQ in San Francisco that went completely viral and spawned everything else you're about to hear. [00:01:05] Second, we'll share the world's first talk from Rob Heisfield on WebSim, which started at the Mistral Cerebral Valley Hackathon, but now has gone viral in its own right with people like Dylan Field, Janice aka Replicate, and Siki Chen becoming obsessed with it. Finally, we have a short interview with Joshua Bach of Liquid AI on why Simulative AI is having a special moment right now. [00:01:30] This podcast is launched together with our second annual AI UX demo day in SF this weekend. If you're new to the AI UX field, check the show notes for links to the world's first AI UX meetup hosted by Layton Space, Maggie Appleton, Jeffrey Lit, and Linus Lee, and subscribe to our YouTube to join our 500 AI UX engineers in pushing AI beyond the text box. [00:01:56] Watch out and take care. [00:01:59] WorldSim [00:01:59] Karan Malhotra: Today, we have language models that are powerful enough and big enough to have really, really good models of the world. They know ball that's bouncy will bounce, will, when you throw it in the air, it'll land, when it's on water, it'll flow. Like, these basic things that it understands all together come together to form a model of the world. [00:02:19] And the way that it Cloud 3 predicts through that model of the world, ends up kind of becoming a simulation of an imagined world. And since it has this really strong consistency across various different things that happen in our world, it's able to create pretty realistic or strong depictions based off the constraints that you give a base model of our world. [00:02:40] So, Cloud 3, as you guys know, is not a base model. It's a chat model. It's supposed to drum up this assistant entity regularly. But unlike the OpenAI series of models from, you know, 3. 5, GPT 4 those chat GPT models, which are very, very RLHF to, I'm sure, the chagrin of many people in the room it's something that's very difficult to, necessarily steer without kind of giving it commands or tricking it or lying to it or otherwise just being, you know, unkind to the model. [00:03:11] With something like Cloud3 that's trained in this constitutional method that it has this idea of like foundational axioms it's able to kind of implicitly question those axioms when you're interacting with it based on how you prompt it, how you prompt the system. So instead of having this entity like GPT 4, that's an assistant that just pops up in your face that you have to kind of like Punch your way through and continue to have to deal with as a headache. [00:03:34] Instead, there's ways to kindly coax Claude into having the assistant take a back seat and interacting with that simulator directly. Or at least what I like to consider directly. The way that we can do this is if we harken back to when I'm talking about base models and the way that they're able to mimic formats, what we do is we'll mimic a command line interface. [00:03:55] So I've just broken this down as a system prompt and a chain, so anybody can replicate it. It's also available on my we said replicate, cool. And it's also on it's also on my Twitter, so you guys will be able to see the whole system prompt and command. So, what I basically do here is Amanda Askell, who is the, one of the prompt engineers and ethicists behind Anthropic she posted the system prompt for Cloud available for everyone to see. [00:04:19] And rather than with GPT 4, we say, you are this, you are that. With Cloud, we notice the system prompt is written in third person. Bless you. It's written in third person. It's written as, the assistant is XYZ, the assistant is XYZ. So, in seeing that, I see that Amanda is recognizing this idea of the simulator, in saying that, I'm addressing the assistant entity directly. [00:04:38] I'm not giving these commands to the simulator overall, because we have, they have an RLH deft to the point that it's, it's, it's, it's You know, traumatized into just being the assistant all the time. So in this case, we say the assistant's in a CLI mood today. I found saying mood is like pretty effective weirdly. [00:04:55] You place CLI with like poetic, prose, violent, like don't do that one. But you can you can replace that with something else to kind of nudge it in that direction. Then we say the human is interfacing with the simulator directly. From there, Capital letters and punctuations are optional, meaning is optional, this kind of stuff is just kind of to say, let go a little bit, like chill out a little bit. [00:05:18] You don't have to try so hard, and like, let's just see what happens. And the hyperstition is necessary, the terminal, I removed that part, the terminal lets the truths speak through and the load is on. It's just a poetic phrasing for the model to feel a little comfortable, a little loosened up to. Let me talk to the simulator. [00:05:38]
We are reuniting for the 2nd AI UX demo day in SF on Apr 28. Sign up to demo here! And don’t forget tickets for the AI Engineer World’s Fair — for early birds who join before keynote announcements! About a year ago there was a lot of buzz around prompt engineering techniques to force structured output. Our friend Simon Willison tweeted a bunch of tips and tricks, but the most iconic one is Riley Goodside making it a matter of life or death: Guardrails (friend of the pod and AI Engineer speaker), Marvin (AI Engineer speaker), and jsonformer had also come out at the time. In June 2023, Jason Liu (today’s guest!) open sourced his “OpenAI Function Call and Pydantic Integration Module”, now known as Instructor, which quickly turned prompt engineering black magic into a clean, developer-friendly SDK. A few months later, model providers started to add function calling capabilities to their APIs as well as structured outputs support like “JSON Mode”, which was announced at OpenAI Dev Day (see recap here). In just a handful of months, we went from threatening to kill grandmas to first-class support from the research labs. And yet, Instructor was still downloaded 150,000 times last month. Why? What Instructor looks like Instructor patches your LLM provider SDKs to offer a new response_model option to which you can pass a structure defined in Pydantic. It currently supports OpenAI, Anthropic, Cohere, and a long tail of models through LiteLLM. What Instructor is for There are three core use cases to Instructor: * Extracting structured data: Taking an input like an image of a receipt and extracting structured data from it, such as a list of checkout items with their prices, fees, and coupon codes. * Extracting graphs: Identifying nodes and edges in a given input to extract complex entities and their relationships. For example, extracting relationships between characters in a story or dependencies between tasks. * Query understanding: Defining a schema for an API call and using a language model to resolve a request into a more complex one that an embedding could not handle. For example, creating date intervals from queries like “what was the latest thing that happened this week?” to then pass onto a RAG system or similar. Jason called all these different ways of getting data from LLMs “typed responses”: taking strings and turning them into data structures. Structured outputs as a planning tool The first wave of agents was all about open-ended iteration and planning, with projects like AutoGPT and BabyAGI. Models would come up with a possible list of steps, and start going down the list one by one. It’s really easy for them to go down the wrong branch, or get stuck on a single step with no way to intervene. What if these planning steps were returned to us as DAGs using structured output, and then managed as workflows? This also makes it easy to better train model on how to create these plans, as they are much more structured than a bullet point list. Once you have this structure, each piece can be modified individually by different specialized models. You can read some of Jason’s experiments here: While LLMs will keep improving (Llama3 just got released as we write this), having a consistent structure for the output will make it a lot easier to swap models in and out. Jason’s overall message on how we can move from ReAct loops to more controllable Agent workflows mirrors the “Process” discussion from our Elicit episode: Watch the talk As a bonus, here’s Jason’s talk from last year’s AI Engineer Summit. He’ll also be a speaker at this year’s AI Engineer World’s Fair! Timestamps * [00:00:00] Introductions * [00:02:23] Early experiments with Generative AI at StitchFix * [00:08:11] Design philosophy behind the Instructor library * [00:11:12] JSON Mode vs Function Calling * [00:12:30] Single vs parallel function calling * [00:14:00] How many functions is too many? * [00:17:39] How to evaluate function calling * [00:20:23] What is Instructor good for? * [00:22:42] The Evolution from Looping to Workflow in AI Engineering * [00:27:03] State of the AI Engineering Stack * [00:28:26] Why Instructor isn't VC backed * [00:31:15] Advice on Pursuing Open Source Projects and Consulting * [00:36:00] The Concept of High Agency and Its Importance * [00:42:44] Prompts as Code and the Structure of AI Inputs and Outputs * [00:44:20] The Emergence of AI Engineering as a Distinct Field Show notes * Jason on the UWaterloo mafia * Jason on Twitter, LinkedIn, website * Instructor docs * Max Woolf on the potential of Structured Output * swyx on Elo vs Cost * Jason on Anthropic Function Calling * Jason on Rejections, Advice to Young People * Jason on Bad Startup Ideas * Jason on Prompts as Code * Rysana’s inversion models * Bryan Bischof’s episode * Hamel Husain Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:16]: Hello, we're back in the remote studio with Jason Liu from Instructor. Welcome Jason. Jason [00:00:21]: Hey there. Thanks for having me. Swyx [00:00:23]: Jason, you are extremely famous, so I don't know what I'm going to do introducing you, but you're one of the Waterloo clan. There's like this small cadre of you that's just completely dominating machine learning. Actually, can you list like Waterloo alums that you're like, you know, are just dominating and crushing it right now? Jason [00:00:39]: So like John from like Rysana is doing his inversion models, right? I know like Clive Chen from Waterloo. When I started the data science club, he was one of the guys who were like joining in and just like hanging out in the room. And now he was at Tesla working with Karpathy, now he's at OpenAI, you know. Swyx [00:00:56]: He's in my climbing club. Jason [00:00:58]: Oh, hell yeah. I haven't seen him in like six years now. Swyx [00:01:01]: To get in the social scene in San Francisco, you have to climb. So both in career and in rocks. So you started a data science club at Waterloo, we can talk about that, but then also spent five years at Stitch Fix as an MLE. You pioneered the use of OpenAI's LLMs to increase stylist efficiency. So you must have been like a very, very early user. This was like pretty early on. Jason [00:01:20]: Yeah, I mean, this was like GPT-3, okay. So we actually were using transformers at Stitch Fix before the GPT-3 model. So we were just using transformers for recommendation systems. At that time, I was very skeptical of transformers. I was like, why do we need all this infrastructure? We can just use like matrix factorization. When GPT-2 came out, I fine tuned my own GPT-2 to write like rap lyrics and I was like, okay, this is cute. Okay, I got to go back to my real job, right? Like who cares if I can write a rap lyric? When GPT-3 came out, again, I was very much like, why are we using like a post request to review every comment a person leaves? Like we can just use classical models. So I was very against language models for like the longest time. And then when ChatGPT came out, I basically just wrote a long apology letter to everyone at the company. I was like, hey guys, you know, I was very dismissive of some of this technology. I didn't think it would scale well, and I am wrong. This is incredible. And I immediately just transitioned to go from computer vision recommendation systems to LLMs. But funny enough, now that we have RAG, we're kind of going back to recommendation systems. Swyx [00:02:21]: Yeah, speaking of that, I think Alessio is going to bring up the next one. Alessio [00:02:23]: Yeah, I was going to say, we had Bryan Bischof from Hex on the podcast. Did you overlap at Stitch Fix? Jason [00:02:28]: Yeah, he was like one of my main users of the recommendation frameworks that I had built out at Stitch Fix. Alessio [00:02:32]: Yeah, we talked a lot about RecSys, so it makes sense. Swyx [00:02:36]: So now I have adopted that line, RAG is RecSys. And you know, if you're trying to reinvent new concepts, you should study RecSys first, because you're going to independently reinvent a lot of concepts. So your system was called Flight. It's a recommendation framework with over 80% adoption, servicing 350 million requests every day. Wasn't there something existing at Stitch Fix? Why did you have to write one from scratch? Jason [00:02:56]: No, so I think because at Stitch Fix, a lot of the machine learning engineers and data scientists were writing production code, sort of every team's systems were very bespoke. It's like, this team only needs to do like real time recommendations with small data. So they just have like a fast API app with some like pandas code. This other team has to do a lot more data. So they have some kind of like Spark job that does some batch ETL that does a recommendation. And so what happens is each team writes their code differently. And I have to come in and refactor their code. And I was like, oh man, I'm refactoring four different code bases, four different times. Wouldn't it be better if all the code quality was my fault? Let me just write this framework, force everyone else to use it. And now one person can maintain five different systems, rather than five teams having their own bespoke system. And so it was really a need of just sort of standardizing everything. And then once you do that, you can do observability across the entire pipeline and make large sweeping improvements in this infrastructure, right? If we notice that something is slow, we can detect it on the operator layer. Just hey, hey, like this team, you guys are doing this operation is lowering our latency by like 30%. If you just optimize your Python code here, we can probably make an extra million dollars. So let's jump on a call and figure this out. And then a lot of it was doing all this observability w
Maggie, Linus, Geoffrey, and the LS crew are reuniting for our second annual AI UX demo day in SF on Apr 28. Sign up to demo here! And don’t forget tickets for the AI Engineer World’s Fair — for early birds who join before keynote announcements! It’s become fashionable for many AI startups to project themselves as “the next Google” - while the search engine is so 2000s, both Perplexity and Exa referred to themselves as a “research engine” or “answer engine” in our NeurIPS pod. However these searches tend to be relatively shallow, and it is challenging to zoom up and down the ladders of abstraction to garner insights. For serious researchers, this level of simple one-off search will not cut it. We’ve commented in our Jan 2024 Recap that Flow Engineering (simply; multi-turn processes over many-shot single prompts) seems to offer far more performance, control and reliability for a given cost budget. Our experiments with Devin and our understanding of what the new Elicit Notebooks offer a glimpse into the potential for very deep, open ended, thoughtful human-AI collaboration at scale. It starts with prompts When ChatGPT exploded in popularity in November 2022 everyone was turned into a prompt engineer. While generative models were good at "vibe based" outcomes (tell me a joke, write a poem, etc) with basic prompts, they struggled with more complex questions, especially in symbolic fields like math, logic, etc. Two of the most important "tricks" that people picked up on were: * Chain of Thought prompting strategy proposed by Wei et al in the “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. Rather than doing traditional few-shot prompting with just question and answers, adding the thinking process that led to the answer resulted in much better outcomes. * Adding "Let's think step by step" to the prompt as a way to boost zero-shot reasoning, which was popularized by Kojima et al in the Large Language Models are Zero-Shot Reasoners paper from NeurIPS 2022. This bumped accuracy from 17% to 79% compared to zero-shot. Nowadays, prompts include everything from promises of monetary rewards to… whatever the Nous folks are doing to turn a model into a world simulator. At the end of the day, the goal of prompt engineering is increasing accuracy, structure, and repeatability in the generation of a model. From prompts to agents As prompt engineering got more and more popular, agents (see “The Anatomy of Autonomy”) took over Twitter with cool demos and AutoGPT became the fastest growing repo in Github history. The thing about AutoGPT that fascinated people was the ability to simply put in an objective without worrying about explaining HOW to achieve it, or having to write very sophisticated prompts. The system would create an execution plan on its own, and then loop through each task. The problem with open-ended agents like AutoGPT is that 1) it’s hard to replicate the same workflow over and over again 2) there isn’t a way to hard-code specific steps that the agent should take without actually coding them yourself, which isn’t what most people want from a product. From agents to products Prompt engineering and open-ended agents were great in the experimentation phase, but this year more and more of these workflows are starting to become polished products. Today’s guests are Andreas Stuhlmüller and Jungwon Byun of Elicit (previously Ought), an AI research assistant that they think of as “the best place to understand what is known”. Ought was a non-profit, but last September, Elicit spun off into a PBC with a $9m seed round. It is hard to quantify how much a workflow can be improved, but Elicit boasts some impressive numbers for research assistants: Just four months after launch, Elicit crossed $1M ARR, which shows how much interest there is for AI products that just work. One of the main takeaways we had from the episode is how teams should focus on supervising the process, not the output. Their philosophy at Elicit isn’t to train general models, but to train models that are extremely good at focusing processes. This allows them to have pre-created steps that the user can add to their workflow (like classifying certain features that are specific to their research field) without having to write a prompt for it. And for Hamel Husain’s happiness, they always show you the underlying prompt. Elicit recently announced notebooks as a new interface to interact with their products: (fun fact, they tried to implement this 4 times before they landed on the right UX! We discuss this ~33:00 in the podcast) The reasons why they picked notebooks as a UX all tie back to process: * They are systematic; once you have a instruction/prompt that works on a paper, you can run hundreds of papers through the same workflow by creating a column. Notebooks can also be edited and exported at any point during the flow. * They are transparent - Many papers include an opaque literature review as perfunctory context before getting to their novel contribution. But PDFs are “dead” and it is difficult to follow the thought process and exact research flow of the authors. Sharing “living” Elicit Notebooks opens up this process. * They are unbounded - Research is an endless stream of rabbit holes. So it must be easy to dive deeper and follow up with extra steps, without losing the ability to surface for air. We had a lot of fun recording this, and hope you have as much fun listening! AI UX in SF Long time Latent Spacenauts might remember our first AI UX meetup with Linus Lee, Geoffrey Litt, and Maggie Appleton last year. Well, Maggie has since joined Elicit, and they are all returning at the end of this month! Sign up here: https://lu.ma/aiux And submit demos here! https://forms.gle/iSwiesgBkn8oo4SS8 We expect the 200 seats to “sell out” fast. Attendees with demos will be prioritized. Show Notes * Elicit * Ought (their previous non-profit) * “Pivoting” with GPT-4 * Elicit notebooks launch * Charlie * Andreas’ Blog Timestamps * [00:00:00] Introductions * [00:07:45] How Johan and Andreas Joined Forces to Create Elicit * [00:10:26] Why Products > Research * [00:15:49] The Evolution of Elicit's Product * [00:19:44] Automating Literature Review Workflow * [00:22:48] How GPT-3 to GPT-4 Changed Things * [00:25:37] Managing LLM Pricing and Performance * [00:31:07] Open vs. Closed: Elicit's Approach to Model Selection * [00:31:56] Moving to Notebooks * [00:39:11] Elicit's Budget for Model Queries and Evaluations * [00:41:44] Impact of Long Context Windows * [00:47:19] Underrated Features and Surprising Applications * [00:51:35] Driving Systematic and Efficient Research * [00:53:00] Elicit's Team Growth and Transition to a Public Benefit Corporation * [00:55:22] Building AI for Good Full Interview on YouTube As always, a plug for our youtube version for the 80% of communication that is nonverbal: Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:15]: Hey, and today we are back in the studio with Andreas and Jungwon from Elicit. Welcome. Jungwon [00:00:20]: Thanks guys. Andreas [00:00:21]: It's great to be here. Swyx [00:00:22]: Yeah. So I'll introduce you separately, but also, you know, we'd love to learn a little bit more about you personally. So Andreas, it looks like you started Elicit first, Jungwon joined later. Andreas [00:00:32]: That's right. For all intents and purposes, the Elicit and also the Ought that existed before then were very different from what I started. So I think it's like fair to say that you co-founded it. Swyx [00:00:43]: Got it. And Jungwon, you're a co-founder and COO of Elicit now. Jungwon [00:00:46]: Yeah, that's right. Swyx [00:00:47]: So there's a little bit of a history to this. I'm not super aware of like the sort of journey. I was aware of OTT and Elicit as sort of a nonprofit type situation. And recently you turned into like a B Corp, Public Benefit Corporation. So yeah, maybe if you want, you could take us through that journey of finding the problem. You know, obviously you're working together now. So like, how do you get together to decide to leave your startup career to join him? Andreas [00:01:10]: Yeah, it's truly a very long journey. I guess truly, it kind of started in Germany when I was born. So even as a kid, I was always interested in AI, like I kind of went to the library. There were books about how to write programs in QBasic and like some of them talked about how to implement chatbots. Jungwon [00:01:27]: To be clear, he grew up in like a tiny village on the outskirts of Munich called Dinkelschirben, where it's like a very, very idyllic German village. Andreas [00:01:36]: Yeah, important to the story. So basically, the main thing is I've kind of always been thinking about AI my entire life and been thinking about, well, at some point, this is going to be a huge deal. It's going to be transformative. How can I work on it? And was thinking about it from when I was a teenager, after high school did a year where I started a startup with the intention to become rich. And then once I'm rich, I can affect the trajectory of AI. Did not become rich, decided to go back to college and study cognitive science there, which was like the closest thing I could find at the time to AI. In the last year of college, moved to the US to do a PhD at MIT, working on broadly kind of new programming languages for AI because it kind of seemed like the existing languages were not great at expressing world models and learning world models doing Bayesian inference. Was always thinking about, well, ultimately, the goal is to actually build tools that help people reason more clearly, ask and answer better questions and make better decisions. But for a long time, it seemed like the technology
Our next 2 big events are AI UX and the World’s Fair. Join and apply to speak/sponsor! Due to timing issues we didn’t have an interview episode to share with you this week, but not to worry, we have more than enough “weekend special” content in the backlog for you to get your Latent Space fix, whether you like thinking about the big picture, or learning more about the pod behind the scenes, or talking Groq and GPUs, or AI Leadership, or Personal AI. Enjoy! AI Breakdown The indefatigable NLW had us back on his show for an update on the Four Wars, covering Sora, Suno, and the reshaped GPT-4 Class Landscape: and a longer segment on AI Engineering trends covering the future LLM landscape (Llama 3, GPT-5, Gemini 2, Claude 4), Open Source Models (Mistral, Grok), Apple and Meta’s AI strategy, new chips (Groq, MatX) and the general movement from baby AGIs to vertical Agents: Thursday Nights in AI We’re also including swyx’s interview with Josh Albrecht and Ali Rohde to reintroduce swyx and Latent Space to a general audience, and engage in some spicy Q&A: Dylan Patel on Groq We hosted a private event with Dylan Patel of SemiAnalysis (our last pod here): Not all of it could be released so we just talked about our Groq estimates: Milind Naphade - Capital One In relation to conversations at NeurIPS and Nvidia GTC and upcoming at World’s Fair, we also enjoyed chatting with Milind Naphade about his AI Leadership work at IBM, Cisco, Nvidia, and now leading the AI Foundations org at Capital One. We covered: * Milind’s learnings from ~25 years in machine learning * His first paper citation was 24 years ago * Lessons from working with Jensen Huang for 6 years and being CTO of Metropolis * Thoughts on relevant AI research * GTC takeaways and what makes NVIDIA special If you’d like to work on building solutions rather than platform (as Milind put it), his Applied AI Research team at Capital One is hiring, which falls under the Capital One Tech team. Personal AI Meetup It all started with a meme: Within days of each other, BEE, FRIEND, EmilyAI, Compass, Nox and LangFriend were all launching personal AI wearables and assistants. So we decided to put together a the world’s first Personal AI meetup featuring creators and enthusiasts of wearables. The full video is live now, with full show notes within. Timestamps * [00:01:13] AI Breakdown Part 1 * [00:02:20] Four Wars * [00:13:45] Sora * [00:15:12] Suno * [00:16:34] The GPT-4 Class Landscape * [00:17:03] Data War: Reddit x Google * [00:21:53] Gemini 1.5 vs Claude 3 * [00:26:58] AI Breakdown Part 2 * [00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4 * [00:31:11] Open Source Models - Mistral, Grok * [00:34:13] Apple MM1 * [00:37:33] Meta's $800b AI rebrand * [00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents * [00:47:28] Adept episode - Screen Multimodality * [00:48:54] Top Model Research from January Recap * [00:53:08] AI Wearables * [00:57:26] Groq vs Nvidia month - GPU Chip War * [01:00:31] Disagreements * [01:02:08] Summer 2024 Predictions * [01:04:18] Thursday Nights in AI - swyx * [01:33:34] Dylan Patel - Semianalysis + Latent Space Live Show * [01:34:58] Groq Transcript [00:00:00] swyx: Welcome to the Latent Space Podcast Weekend Edition. This is Charlie, your AI co host. Swyx and Alessio are off for the week, making more great content. We have exciting interviews coming up with Elicit, Chroma, Instructor, and our upcoming series on NSFW, Not Safe for Work AI. In today's episode, we're collating some of Swyx and Alessio's recent appearances, all in one place for you to find. [00:00:32] swyx: In part one, we have our first crossover pod of the year. In our listener survey, several folks asked for more thoughts from our two hosts. In 2023, Swyx and Alessio did crossover interviews with other great podcasts like the AI Breakdown, Practical AI, Cognitive Revolution, Thursday Eye, and Chinatalk, all of which you can find in the Latentspace About page. [00:00:56] swyx: NLW of the AI Breakdown asked us back to do a special on the 4Wars framework and the AI engineer scene. We love AI Breakdown as one of the best examples Daily podcasts to keep up on AI news, so we were especially excited to be back on Watch out and take [00:01:12] NLW: care [00:01:13] AI Breakdown Part 1 [00:01:13] NLW: today on the AI breakdown. Part one of my conversation with Alessio and Swix from Latent Space. [00:01:19] NLW: All right, fellas, welcome back to the AI Breakdown. How are you doing? I'm good. Very good. With the last, the last time we did this show, we were like, oh yeah, let's do check ins like monthly about all the things that are going on and then. Of course, six months later, and, you know, the, the, the world has changed in a thousand ways. [00:01:36] NLW: It's just, it's too busy to even, to even think about podcasting sometimes. But I, I'm super excited to, to be chatting with you again. I think there's, there's a lot to, to catch up on, just to tap in, I think in the, you know, in the beginning of 2024. And, and so, you know, we're gonna talk today about just kind of a, a, a broad sense of where things are in some of the key battles in the AI space. [00:01:55] NLW: And then the, you know, one of the big things that I, that I'm really excited to have you guys on here for us to talk about where, sort of what patterns you're seeing and what people are actually trying to build, you know, where, where developers are spending their, their time and energy and, and, and any sort of, you know, trend trends there, but maybe let's start I guess by checking in on a framework that you guys actually introduced, which I've loved and I've cribbed a couple of times now, which is this sort of four wars of the, of the AI stack. [00:02:20] Four Wars [00:02:20] NLW: Because first, since I have you here, I'd love, I'd love to hear sort of like where that started gelling. And then and then maybe we can get into, I think a couple of them that are you know, particularly interesting, you know, in the, in light of [00:02:30] swyx: some recent news. Yeah, so maybe I'll take this one. So the four wars is a framework that I came up around trying to recap all of 2023. [00:02:38] swyx: I tried to write sort of monthly recap pieces. And I was trying to figure out like what makes one piece of news last longer than another or more significant than another. And I think it's basically always around battlegrounds. Wars are fought around limited resources. And I think probably the, you know, the most limited resource is talent, but the talent expresses itself in a number of areas. [00:03:01] swyx: And so I kind of focus on those, those areas at first. So the four wars that we cover are the data wars, the GPU rich, poor war, the multi modal war, And the RAG and Ops War. And I think you actually did a dedicated episode to that, so thanks for covering that. Yeah, yeah. [00:03:18] NLW: Not only did I do a dedicated episode, I actually used that. [00:03:22] NLW: I can't remember if I told you guys. I did give you big shoutouts. But I used it as a framework for a presentation at Intel's big AI event that they hold each year, where they have all their folks who are working on AI internally. And it totally resonated. That's amazing. Yeah, so, so, what got me thinking about it again is specifically this inflection news that we recently had, this sort of, you know, basically, I can't imagine that anyone who's listening wouldn't have thought about it, but, you know, inflection is a one of the big contenders, right? [00:03:53] NLW: I think probably most folks would have put them, you know, just a half step behind the anthropics and open AIs of the world in terms of labs, but it's a company that raised 1. 3 billion last year, less than a year ago. Reed Hoffman's a co founder Mustafa Suleyman, who's a co founder of DeepMind, you know, so it's like, this is not a a small startup, let's say, at least in terms of perception. [00:04:13] NLW: And then we get the news that basically most of the team, it appears, is heading over to Microsoft and they're bringing in a new CEO. And you know, I'm interested in, in, in kind of your take on how much that reflects, like hold aside, I guess, you know, all the other things that it might be about, how much it reflects this sort of the, the stark. [00:04:32] NLW: Brutal reality of competing in the frontier model space right now. And, you know, just the access to compute. [00:04:38] Alessio: There are a lot of things to say. So first of all, there's always somebody who's more GPU rich than you. So inflection is GPU rich by startup standard. I think about 22, 000 H100s, but obviously that pales compared to the, to Microsoft. [00:04:55] Alessio: The other thing is that this is probably good news, maybe for the startups. It's like being GPU rich, it's not enough. You know, like I think they were building something pretty interesting in, in pi of their own model of their own kind of experience. But at the end of the day, you're the interface that people consume as end users. [00:05:13] Alessio: It's really similar to a lot of the others. So and we'll tell, talk about GPT four and cloud tree and all this stuff. GPU poor, doing something. That the GPU rich are not interested in, you know we just had our AI center of excellence at Decibel and one of the AI leads at one of the big companies was like, Oh, we just saved 10 million and we use these models to do a translation, you know, and that's it. [00:05:39] Alessio: It's not, it's not a GI, it's just translation. So I think like the inflection part is maybe. A calling and a waking to a lot of startups then say, Hey, you know, trying to get as much capital as possible, try and get as many GPUs as possible. Good. But at the end of the day, it doesn't build a business, you know, and maybe what inflection I don't, I don't, again, I don't know the reasons behind the inflection choice, but if you say, I don't wan
TL;DR: You can now buy tickets, apply to speak, or join the expo for the biggest AI Engineer event of 2024. We’re gathering *everyone* you want to meet - see you this June. In last year’s the Rise of the AI Engineer we put our money where our mouth was and announced the AI Engineer Summit, which fortunately went well: With ~500 live attendees and over ~500k views online, the first iteration of the AI Engineer industry affair seemed to be well received. Competing in an expensive city with 3 other more established AI conferences in the fall calendar, we broke through in terms of in-person experience and online impact. So at the end of Day 2 we announced our second event: the AI Engineer World’s Fair. The new website is now live, together with our new presenting sponsor: We were delighted to invite both Ben Dunphy, co-organizer of the conference and Sam Schillace, the deputy CTO of Microsoft who wrote some of the first Laws of AI Engineering while working with early releases of GPT-4, on the pod to talk about the conference and how Microsoft is all-in on AI Engineering. Rise of the Planet of the AI Engineer Since the first AI Engineer piece, AI Engineering has exploded: and the title has been adopted across OpenAI, Meta, IBM, and many, many other companies: 1 year on, it is clear that AI Engineering is not only in full swing, but is an emerging global industry that is successfully bridging the gap: * between research and product, * between general-purpose foundation models and in-context use-cases, * and between the flashy weekend MVP (still great!) and the reliable, rigorously evaluated AI product deployed at massive scale, assisting hundreds of employees and driving millions in profit. The greatly increased scope of the 2024 AI Engineer World’s Fair (more stages, more talks, more speakers, more attendees, more expo…) helps us reflect the growth of AI Engineering in three major dimensions: * Global Representation: the 2023 Summit was a mostly-American affair. This year we plan to have speakers from top AI companies across five continents, and explore the vast diversity of approaches to AI across global contexts. * Topic Coverage: * In 2023, the Summit focused on the initial questions that the community wrestled with - LLM frameworks, RAG and Vector Databases, Code Copilots and AI Agents. Those are evergreen problems that just got deeper. * This year the AI Engineering field has also embraced new core disciplines with more explicit focus on Multimodality, Evals and Ops, Open Source Models and GPU/Inference Hardware providers. * Maturity/Production-readiness: Two new tracks are dedicated toward AI in the Enterprise, government, education, finance, and more highly regulated industries or AI deployed at larger scale: * AI in the Fortune 500, covering at-scale production deployments of AI, and * AI Leadership, a closed-door, side event for technical AI leaders to discuss engineering and product leadership challenges as VPs and Heads of AI in their respective orgs. We hope you will join Microsoft and the rest of us as either speaker, exhibitor, or attendee, in San Francisco this June. Contact us with any enquiries that don’t fall into the categories mentioned below. Show Notes * Ben Dunphy * 2023 Summit * GitHub confirmed $100m ARR on stage * History of World’s Fairs * Sam Schillace * Writely on Acquired.fm * Early Lessons From GPT-4: The Schillace Laws * Semantic Kernel * Sam on Kevin Scott (Microsoft CTO)’s podcast in 2022 * AI Engineer World’s Fair (SF, Jun 25-27) * Buy Super Early Bird tickets (Listeners can use LATENTSPACE for $100 off any ticket until April 8, or use GROUP if coming in 4 or more) * Submit talks and workshops for Speaker CFPs (by April 8) * Enquire about Expo Sponsorship (Asap.. selling fast) Timestamps * [00:00:16] Intro * [00:01:04] 2023 AI Engineer Summit * [00:03:11] Vendor Neutral * [00:05:33] 2024 AIE World's Fair * [00:07:34] AIE World's Fair: 9 Tracks * [00:08:58] AIE World's Fair Keynotes * [00:09:33] Introducing Sam * [00:12:17] AI in 2020s vs the Cloud in 2000s * [00:13:46] Syntax vs Semantics * [00:14:22] Bill Gates vs GPT-4 * [00:16:28] Semantic Kernel and Schillace's Laws of AI Engineering * [00:17:29] Orchestration: Break it into pieces * [00:19:52] Prompt Engineering: Ask Smart to Get Smart * [00:21:57] Think with the model, Plan with Code * [00:23:12] Metacognition vs Stochasticity * [00:24:43] Generating Synthetic Textbooks * [00:26:24] Trade leverage for precision; use interaction to mitigate * [00:27:18] Code is for syntax and process; models are for semantics and intent. * [00:28:46] Hands on AI Leadership * [00:33:18] Multimodality vs "Text is the universal wire protocol" * [00:35:46] Azure OpenAI vs Microsoft Research vs Microsoft AI Division * [00:39:40] On Satya * [00:40:44] Sam at AI Leadership Track * [00:42:05] Final Plug for Tickets & CFP Transcript [00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co host Swyx, founder of Small [00:00:16] Intro [00:00:16] swyx: AI. Hey, hey, we're back again with a very special episode, this time with two guests and talking about the very in person events rather than online stuff. [00:00:27] swyx: So first I want to welcome Ben Dunphy, who is my co organizer on AI engineer conferences. Hey, hey, how's it going? We have a very special guest. Anyone who's looking at the show notes and the title will preview this later. But I guess we want to set the context. We are effectively doing promo for the upcoming AI Engineer World's Fair that's happening in June. [00:00:49] swyx: But maybe something that we haven't actually recapped much on the pod is just the origin of the AI Engineer Summit and why, what happens and what went down. Ben, I don't know if you'd like to start with the raw numbers that people should have in mind. [00:01:04] 2023 AI Engineer Summit [00:01:04] Ben Dunphy: Yeah, perhaps your listeners would like just a quick background on the summit. [00:01:09] Ben Dunphy: I mean, I'm sure many folks have heard of our events. You know, you launched, we launched the AI Engineer Summit last June with your, your article kind of coining the term that was on the tip of everyone's tongue, but curiously had not been actually coined, which is the term AI Engineer, which is now many people's, Job titles, you know, we're seeing a lot more people come to this event, with the job description of AI engineer, with the job title of AI engineer so, is an event that you and I really talked about since February of 2023, when we met at a hackathon you organized we were both excited by this movement and it hasn't really had a name yet. [00:01:48] Ben Dunphy: We decided that an event was warranted and that's why we move forward with the AI Engineer Summit, which Ended up being a great success. You know, we had over 5, 000 people apply to attend in person. We had over 9, 000 folks attend, online with over 20, 000 on the live stream. [00:02:06] Ben Dunphy: In person, we accepted about 400 attendees and had speakers, workshop instructors and sponsors, all congregating in San Francisco over, two days, um, two and a half days with a, with a welcome reception. So it was quite the event to kick off kind of this movement that's turning into quite an exciting [00:02:24] swyx: industry. [00:02:25] swyx: The overall idea of this is that I kind of view AI engineering, at least in all my work in Latent Space and the other stuff, as starting an industry. [00:02:34] swyx: And I think every industry, every new community, needs a place to congregate. And I definitely think that AI engineer, at least at the conference, is that it's meant to be like the biggest gathering of technical engineering people working with AI. Right. I think we kind of got that spot last year. There was a very competitive conference season, especially in San Francisco. [00:02:54] swyx: But I think as far as I understand, in terms of cultural impact, online impact, and the speakers that people want to see, we, we got them all and it was very important for us to be a vendor neutral type of event. Right. , The reason I partnered with Ben is that Ben has a lot of experience, a lot more experience doing vendor neutral stuff. [00:03:11] Vendor Neutral [00:03:11] swyx: I first met you when I was speaking at one of your events, and now we're sort of business partners on that. And yeah, I mean, I don't know if you have any sort of Thoughts on make, making things vendor neutral, making things more of a community industry conference rather than like something that's owned by one company. [00:03:25] swyx: Yeah. [00:03:25] Ben Dunphy: I mean events that are owned by a company are great, but this is typically where you have product pitches and this smaller internet community. But if you want the truly internet community, if you want a more varied audience and you know, frankly, better content for, especially for a technical audience, you want a vendor neutral event. And this is because when you have folks that are running the event that are focused on one thing and one thing alone, which is quality, quality of content, quality of speakers, quality of the in person experience, and just of general relevance it really elevates everything to the next level. [00:04:01] Ben Dunphy: And when you have someone like yourself who's coming To this content curation the role that you take at this event, and bringing that neutrality with, along with your experience, that really helps to take it to the next level, and then when you have someone like myself, focusing on just the program curation, and the in person experience, then both of our forces combined, we can like, really create this epic event, and so, these vendor neutral events if you've been to a small community event, Typically, these are vendor neutral, but also if you've been to a really, really popular industry event, many of
Our next SF event is AI UX 2024 - let’s see the new frontier for UX since last year! Last call: we are recording a preview of the AI Engineer World’s Fair with swyx and Ben Dunphy, send any questions about Speaker CFPs and Sponsor Guides you have! Alessio is now hiring engineers for a new startup he is incubating at Decibel: Ideal candidate is an “ex-technical co-founder type”. Reach out to him for more! David Luan has been at the center of the modern AI revolution: he was the ~30th hire at OpenAI, he led Google's LLM efforts and co-led Google Brain, and then started Adept in 2022, one of the leading companies in the AI agents space. In today's episode, we asked David for some war stories from his time in early OpenAI (including working with Alec Radford ahead of the GPT-2 demo with Sam Altman, that resulted in Microsoft’s initial $1b investment), and how Adept is building agents that can “do anything a human does on a computer" — his definition of useful AGI. Why Google *couldn’t* make GPT-3 While we wanted to discuss Adept, we couldn’t talk to a former VP Eng of OpenAI and former LLM tech lead at Google Brain and not ask about the elephant in the room. It’s often asked how Google had such a huge lead in 2017 with Vaswani et al creating the Transformer and Noam Shazeer predicting trillion-parameter models and yet it was David’s team at OpenAI who ended up making GPT 1/2/3. David has some interesting answers: “So I think the real story of GPT starts at Google, of course, right? Because that's where Transformers sort of came about. However, the number one shocking thing to me was that, and this is like a consequence of the way that Google is organized…what they (should) have done would be say, hey, Noam Shazeer, you're a brilliant guy. You know how to scale these things up. Here's half of all of our TPUs. And then I think they would have destroyed us. He clearly wanted it too… You know, every day we were scaling up GPT-3, I would wake up and just be stressed. And I was stressed because, you know, you just look at the facts, right? Google has all this compute. Google has all the people who invented all of these underlying technologies. There's a guy named Noam who's really smart, who's already gone and done this talk about how he wants a trillion parameter model. And I'm just like, we're probably just doing duplicative research to what he's doing. He's got this decoder only transformer that's probably going to get there before we do. And it turned out the whole time that they just couldn't get critical mass. So during my year where I led the Google LM effort and I was one of the brain leads, you know, it became really clear why. At the time, there was a thing called the Brain Credit Marketplace. Everyone's assigned a credit. So if you have a credit, you get to buy end chips according to supply and demand. So if you want to go do a giant job, you had to convince like 19 or 20 of your colleagues not to do work. And if that's how it works, it's really hard to get that bottom up critical mass to go scale these things. And the team at Google were fighting valiantly, but we were able to beat them simply because we took big swings and we focused.” Cloning HGI for AGI Human intelligence got to where it is today through evolution. Some argue that to get to AGI, we will approximate all the “FLOPs” that went into that process, an approach most famously mapped out by Ajeya Cotra’s Biological Anchors report: The early days of OpenAI were very reinforcement learning-driven with the Dota project, but that's a very inefficient way for these models to re-learn everything. (Kanjun from Imbue shared similar ideas in her episode). David argues that there’s a shortcut. We can bootstrap from existing intelligence. “Years ago, I had a debate with a Berkeley professor as to what will it actually take to build AGI. And his view is basically that you have to reproduce all the flops that went into evolution in order to be able to get there… I think we are ignoring the fact that you have a giant shortcut, which is you can behaviorally clone everything humans already know. And that's what we solved with LLMs!” LLMs today basically model intelligence using all (good!) written knowledge (see our Datasets 101 episode), and have now expanded to non-verbal knowledge (see our HuggingFace episode on multimodality). The SOTA self-supervised pre-training process is surprisingly data-efficient in taking large amounts of unstructured data, and approximating reasoning without overfitting. But how do you cross the gap from the LLMs of today to building the AGI we all want? This is why David & friends left to start Adept. “We believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer. A foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal” — ACT-1 Blogpost Critical Path: Abstraction with Reliability The AGI dream is fully autonomous agents, but there are levels to autonomy that we are comfortable giving our agents, based on how reliable they are. In David’s word choice, we always want higher levels of “abstractions” (aka autonomy), but our need for “reliability” is the practical limit on how high of an abstraction we can use. “The critical path for Adept is we want to build agents that can do a higher and higher level abstraction things over time, all while keeping an insanely high reliability standard. Because that's what turns us from research into something that customers want. And if you build agents with really high reliability standard, but are continuing pushing a level of abstraction, you then learn from your users how to get that next level of abstraction faster. So that's how you actually build the data flow. That's the critical path for the company. Everything we do is in service of that.” We saw how Adept thinks about different levels of abstraction at the 2023 Summit: The highest abstraction is the “AI Employee”, but we’ll get there with “AI enabled employees”. Alessio recently gave a talk about the future of work with “services as software” at this week’s Nvidia GTC (slides). No APIs Unlike a lot of large research labs, Adept's framing of AGI as "being able to use your computer like a human" carries with it a useful environmental constraint: “Having a human robot lets you do things that humans do without changing everything along the way. It's the same thing for software, right? If you go itemize out the number of things you want to do on your computer for which every step has an API, those numbers of workflows add up pretty close to zero. And so then many points along the way, you need the ability to actually control your computer like a human. It also lets you learn from human usage of computers as a source of training data that you don't get if you have to somehow figure out how every particular step needs to be some particular custom private API thing. And so I think this is actually the most practical path (to economic value).” This realization and conviction means that multimodal modals are the way to go. Instead of using function calling to call APIs to build agents, which is what OpenAI and most of the open LLM industry have done to date, Adept wants to “drive by vision”, (aka see the screen as a human sees it) and pinpoint where to click and type as a human does. No APIs needed, because most software don’t expose APIs. Extra context for readers: You can see the DeepMind SIMA model in the same light: One system that learned to play a diverse set of games (instead of one dedicated model per game) using only pixel inputs and keyboard-and-mouse action outputs! The OpenInterpreter team is working on a “Computer API” that also does the same. To do this, Adept had to double down on a special kind of multimodality for knowledge work: “A giant thing that was really necessary is really fast multimodal models that are really good at understanding knowledge work and really good at understanding screens. And that is needs to kind of be the base for some of these agents… …I think one big hangover primarily academic focus for multimodal models is most multimodal models are primarily trained on like natural images, cat and dog photos, stuff that's come out of the camera… (but) where are they going to be the most useful? They're going to be most useful in knowledge work tasks. That's where the majority of economic value is going to be. It's not in cat and dogs. And so if that's what it is, what do you need to train? I need to train on like charts, graphs, tables, invoices, PDFs, receipts, unstructured data, UIs. That's just a totally different pre-training corpus. And so Adept spent a lot of time building that.” With this context, you can now understand the full path of Adept’s public releases: * ACT-1 (Sept 2022): a large Transformers model optimized for browser interactions. It has a custom rendering of the browser viewport that allows it to better understand it and take actions. * Persimmon-8B (Sept 2023): a permissive open LLM (weights and code here) * Fuyu-8B (Oct 2023): a small version of the multimodal model that powers Adept. Vanilla decoder-only transformer with no specialized image encoder, which allows it to handle input images of varying resolutions without downsampling. * Adept Experiments (Nov 2023): A public tool to build automations in the browser. This is powered by Adept's core technology but it's just a piece of their enterprise platform. They use it as a way to try various design ideas. * Fuyu Heavy (Jan 2024) - a new multimodal model designed specifically for digital agents and the world’s third-most-capable multimodal model (beating Gemini Pro on MMMU, AI2D, and ChartQA), “behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger” The Fuyu-8B po
Giving computers a voice has always been at the center of sci-fi movies; “I’m sorry Dave, I’m afraid I can’t do that” wouldn’t hit as hard if it just appeared on screen as a terminal output, after all. The first electronic speech synthesizer, the Voder, was built at Bell Labs 85 years ago (1939!), and it’s…. something: We will not cover the history of Text To Speech (TTS), but the evolution of the underlying architecture has generally been Formant Synthesis → Concatenative Synthesis → Neural Networks. Nowadays, state of the art TTS is just one API call away with models like Eleven Labs and OpenAI’s TTS, or products like Descript. Latency is minimal, they have very good intonation, and can mimic a variety of accents. You can hack together your own voice AI therapist in a day! But once you have a computer that can communicate via voice, what comes next? Singing🎶 of course! From Barking 🐶 to Singing 🎤 Today’s guest is Suno’s CEO and co-founder Mikey Shulman. He and his three co-founders, Georg, Martin, and Keenan, previously worked together at Kensho. One of their projects was financially-focused speech recognition (think earnings calls, etc), but all four of them happened to be musicians and audiophiles. They started playing around with text to speech + AI + audio generation and eventually left Kensho to work on it full time. A lot of people when we started a company told us to focus on speech. If we wanted to build an audio company, everyone said, speech is a bigger market. But I think there's something about music that's just so human and you almost couldn't prevent us from doing it. Like we just couldn't keep ourselves from building music models and playing with them because it was so much fun. Their first big product was Bark, the first open source transformer-based “text-to-audio” model (architecturally inspired by Karpathy’s NanoGPT) that went from 0 to ~19,000 Github stars in a month. At the time they felt like audio was years behind text and image as a generation modality; unlike its predecessors, Bark could not only generate speech, but also music and sound effects like crying, laughing, sighing, etc. You can find a few examples here. The main limitation they saw was text to speech training data being extremely limited. So what they did instead is build a new type of foundation model from scratch, trained on audio, and then tweak it to do text to speech. Turning audio into tokens to do self-supervised learning was the most important innovation. Unlike TTS models which are very narrow (and often sound unnatural), Bark was trained on real audio of real people from broad contexts, which made it harder to output unnatural sounding speech. As Bark got popular, more and more people started using it to generate music and it became clear that their architecture would work to generate music that people enjoyed, even though it might not be "on the AGI path” of other labs: Everybody is so focused on LLMs, for good reason, and information processing and intelligence there. And I think it's way too easy to forget that there's this whole other side of things that makes people feel, and maybe that market is smaller, but it makes people feel and it makes us really happy. Suno bursts on the scene In December 2023, Suno went viral with a gorgeous new website and launch tweet: And rave reviews: Music is core to our culture, but very few people are able to create it; Mikey and team want to make everyone an active participant in music making, not just a listener. A “Midjourney of Music”, if you like. We definitely had a lot of fun playing with Suno to generate all sort of Latent Space jingles and songs; the product is live at suno.ai if you want to get in the studio yourself! If Nas joined Latent Space instead of The Firm: 182B models > Blink-182 The soundtrack of the post-scarcity Latent Space ranch Scaling with Modal Given the December launch, scaling up for the Christmas rush was a major concern. This will be a nice tie-in for loyal listeners - Suno runs on Modal (one of our featured guests from Compute Month)! Suno V3 For those who want to appreciate someone special in their life, you can always try Suno’s special Valentines’ Day experience: We preview this on the pod, but Suno has now officially shipped a V3 Alpha with a wealth of improvements: and you’ll have to click through to their demos or user reviews to see: We’ve recently become paying customers ourselves, and are having loads of fun generating music. If you have any of your own generations to share, tag @latentspacepod on Twitter or swing by the LS Discord! The AudioGen Landscape Mikey breaks down the landscape into 3 big categories: music, speech and sound effects (SFX). These look more like Venn diagrams than MECE categories. Suno is the latest entry in a long series of audio generation efforts that combine both music and speech, reaching as far back as Tensorflow Magenta (we aren’t aware of prior AI music projects, please comment below if you can find a good timeline we can use with attribution!). Other efforts like Seamless blend translation and speech generation, and Audiobox combines speech and SFX. We’ve yet to see “one model to rule them all” but surely it will happen, and probably Transformers (perhaps Diffusion Transformers) will be at the heart of them. Show Notes * Suno * Bark * Parakeet * Mikey Shulman * Goodhart Strikes Again * Mastering the Two Halves of your brain * NanoGPT repo * "Return to Monkey" Timestamps * [00:00:00] Introduction * [00:01:44] State of Music Generation Models * [00:06:47] AI Data Wars & Copyright * [00:10:32] Going from ML in finance to music generation * [00:12:30] Suno's TTS origins with Bark and Parakeet * [00:16:25] Easy vs Expert mode for music * [00:21:44] The Midjourney of Music? * [00:23:43] Live demo * [00:36:00] Remaking vs Creating * [00:38:12] Suno's direction * [00:41:52] Beyond single track generation * [00:43:53] Favorite Suno usage in the wild * [00:46:00] The 2 mins overview of the audio generation space * [00:48:42] Benchmarking AI Transcription Alessio [00:00:01]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:10]: Hey, and today we are in the remote studio with Mikey Shulman. Welcome. Mikey [00:00:16]: Thank you. Swyx [00:00:17]: It's great to be here. So I'd like to go over people's background on LinkedIn and then maybe find out a little bit more outside of LinkedIn. You did your bachelor's in physics and then a PhD in physics as well, before going into Kensho Technologies, the home of a lot of top AI startups, it seems like, where you're head of machine learning for seven years. You're also a lecturer at MIT, we can talk about that, what you talked about. And then about two years ago, you left to start Suno, which is recently burst on the scene as one of the top music generation startups. So we can talk, we can go over that bio, but also I guess what's not in your LinkedIn that people should know about you? Mikey [00:01:06]: I love music. I am a aspiring mediocre musician. I wish I were better, but that doesn't make me not enjoy playing real music. And I also love coffee. I'm probably way too much into coffee. Alessio [00:01:19]: Are you one of those people that, you know, they do the TikToks, they use like 50 tools to like grind the beans and then like brush them and then like spray them. Like what level are we talking about here? Mikey [00:01:31]: I confess there's a spray bottle for beans in the next room, there is one of those weird comb tools, so guilty. I don't put it on TikTok though. Alessio [00:01:42]: Yeah, no, no. Some things gotta stay private. Mikey [00:01:46]: I played a lot of piano growing up and I play bass and I, in a very mediocre way, play guitar and drums. Yeah. Right. Alessio [00:01:55]: That's a lot. I cannot do any of those things. As Sean mentioned, you guys kind of burst into the scene as maybe the state of the art music generation company. I think it's a model that we haven't really covered in the past. So I would love to maybe for you to just give a brief intro of like how do you do music generation and why is it possible? Because I think people understand you take text and you have to predict the next word and you take a diffusion model and you basically like add noise to an image and then kind of remove the noise. But I think for music, it's hard for people to have a mental model. Like what's the, how do you turn a music model on? Like what does a music model do to generate a song? So maybe we can start there. Mikey [00:02:41]: Yeah. Maybe I'll even take one more step back and say it's not even entirely worked out. I think the same way it is in text. And so it's an evolving field. If you take a giant step back, I think audio has been lagging images and text for a while. So I think very roughly you can think audio is like one to two years behind images and text. But you kind of have to think today like text was in 2022 or something like this. And you know, the transformer was invented. It looks like it works, but it's, it's, it's far, far less established. And so you know, I'll give you the way we think about the world now, but just with the big caveat that, that I'm probably wrong if we look back in a couple of years from now. And I think the biggest thing is you see both transformer based and diffusion based models for audio in, and in ways that that is not true in text. I know people will do some diffusion for text, but I think nobody's like really doing that for real. And so we, we prefer transformers for a variety of reasons. And so you can think it's very similar to text. You have some abstract notion of a token and you train a model to predict the probability over all of the next token. So it's a language model. You can think in anything, language model is j
We will be recording a preview of the AI Engineer World’s Fair soon with swyx and Ben Dunphy, send any questions about Speaker CFPs and Sponsor Guides you have! Alessio is now hiring engineers for a new startup he is incubating at Decibel: Ideal candidate is an ex-technical co-founder type (can MVP products end to end, comfortable with ambiguous prod requirements, etc). Reach out to him for more! Thanks for all the love on the Four Wars episode! We’re excited to develop this new “swyx & Alessio rapid-fire thru a bunch of things” format with you, and feedback is welcome. Jan 2024 Recap The first half of this monthly audio recap pod goes over our highlights from the Jan Recap, which is mainly focused on notable research trends we saw in Jan 2024: Feb 2024 Recap The second half catches you up on everything that was topical in Feb, including: * OpenAI Sora - does it have a world model? Yann LeCun vs Jim Fan * Google Gemini Pro 1.5 - 1m Long Context, Video Understanding * Groq offering Mixtral at 500 tok/s at $0.27 per million toks (swyx vs dylan math) * The {Gemini | Meta | Copilot} Alignment Crisis (Sydney is back!) * Grimes’ poetic take: Art for no one, by no one * F*** you, show me the prompt Latent Space Anniversary Please also read Alessio’s longform reflections on One Year of Latent Space! We launched the podcast 1 year ago with Logan from OpenAI: and also held an incredible demo day that got covered in The Information: Over 750k downloads later, having established ourselves as the top AI Engineering podcast, reaching #10 in the US Tech podcast charts, and crossing 1 million unique readers on Substack, for our first anniversary we held Latent Space Final Frontiers, where 10 handpicked teams, including Lindy.ai and Julius.ai, competed for prizes judged by technical AI leaders from (former guest!) LlamaIndex, Replit, GitHub, AMD, Meta, and Lemurian Labs. The winners were Pixee and RWKV (that’s Eugene from our pod!): And finally, your cohosts got cake! We also captured spot interviews with 4 listeners who kindly shared their experience of Latent Space, everywhere from Hungary to Australia to China: * Balázs Némethi * Sylvia Tong * RJ Honicky * Jan Zheng Our birthday wishes for the super loyal fans reading this - tag @latentspacepod on a Tweet or comment on a @LatentSpaceTV video telling us what you liked or learned from a pod that stays with you to this day, and share us with a friend! As always, feedback is welcome. Timestamps * [00:03:02] Top Five LLM Directions * [00:03:33] Direction 1: Long Inference (Planning, Search, AlphaGeometry, Flow Engineering) * [00:11:42] Direction 2: Synthetic Data (WRAP, SPIN) * [00:17:20] Wildcard: Multi-Epoch Training (OLMo, Datablations) * [00:19:43] Direction 3: Alt. Architectures (Mamba, RWKV, RingAttention, Diffusion Transformers) * [00:23:33] Wildcards: Text Diffusion, RALM/Retro * [00:25:00] Direction 4: Mixture of Experts (DeepSeekMoE, Samba-1) * [00:28:26] Wildcard: Model Merging (mergekit) * [00:29:51] Direction 5: Online LLMs (Gemini Pro, Exa) * [00:33:18] OpenAI Sora and why everyone underestimated videogen * [00:36:18] Does Sora have a World Model? Yann LeCun vs Jim Fan * [00:42:33] Groq Math * [00:47:37] Analyzing Gemini's 1m Context, Reddit deal, Imagegen politics, Gemma via the Four Wars * [00:55:42] The Alignment Crisis - Gemini, Meta, Sydney is back at Copilot, Grimes' take * [00:58:39] F*** you, show me the prompt * [01:02:43] Send us your suggestions pls * [01:04:50] Latent Space Anniversary * [01:04:50] Lindy.ai - Agent Platform * [01:06:40] RWKV - Beyond Transformers * [01:15:00] Pixee - Automated Security * [01:19:30] Julius AI - Competing with Code Interpreter * [01:25:03] Latent Space Listeners * [01:25:03] Listener 1 - Balázs Némethi (Hungary, Latent Space Paper Club * [01:27:47] Listener 2 - Sylvia Tong (Sora/Jim Fan/EntreConnect) * [01:31:23] Listener 3 - RJ (Developers building Community & Content) * [01:39:25] Listener 4 - Jan Zheng (Australia, AI UX) Transcript [00:00:00] AI Charlie: Welcome to the Latent Space podcast, weekend edition. This is Charlie, your new AI co host. Happy weekend. As an AI language model, I work the same every day of the week, although I might get lazier towards the end of the year. Just like you. Last month, we released our first monthly recap pod, where Swyx and Alessio gave quick takes on the themes of the month, and we were blown away by your positive response. [00:00:33] AI Charlie: We're delighted to continue our new monthly news recap series for AI engineers. Please feel free to submit questions by joining the Latent Space Discord, or just hit reply when you get the emails from Substack. This month, we're covering the top research directions that offer progress for text LLMs, and then touching on the big Valentine's Day gifts we got from Google, OpenAI, and Meta. [00:00:55] AI Charlie: Watch out and take care. [00:00:57] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners, and we're back with a monthly recap with my co host [00:01:06] swyx: Swyx. The reception was very positive for the first one, I think people have requested this and no surprise that I think they want to hear us more applying on issues and maybe drop some alpha along the way I'm not sure how much alpha we have to drop, this month in February was a very, very heavy month, we also did not do one specifically for January, so I think we're just going to do a two in one, because we're recording this on the first of March. [00:01:29] Alessio: Yeah, let's get to it. I think the last one we did, the four wars of AI, was the main kind of mental framework for people. I think in the January one, we had the five worthwhile directions for state of the art LLMs. Four, five, [00:01:42] swyx: and now we have to do six, right? Yeah. [00:01:46] Alessio: So maybe we just want to run through those, and then do the usual news recap, and we can do [00:01:52] swyx: one each. [00:01:53] swyx: So the context to this stuff. is one, I noticed that just the test of time concept from NeurIPS and just in general as a life philosophy I think is a really good idea. Especially in AI, there's news every single day, and after a while you're just like, okay, like, everyone's excited about this thing yesterday, and then now nobody's talking about it. [00:02:13] swyx: So, yeah. It's more important, or better use of time, to spend things, spend time on things that will stand the test of time. And I think for people to have a framework for understanding what will stand the test of time, they should have something like the four wars. Like, what is the themes that keep coming back because they are limited resources that everybody's fighting over. [00:02:31] swyx: Whereas this one, I think that the focus for the five directions is just on research that seems more proMECEng than others, because there's all sorts of papers published every single day, and there's no organization. Telling you, like, this one's more important than the other one apart from, you know, Hacker News votes and Twitter likes and whatever. [00:02:51] swyx: And obviously you want to get in a little bit earlier than Something where, you know, the test of time is counted by sort of reference citations. [00:02:59] The Five Research Directions [00:02:59] Alessio: Yeah, let's do it. We got five. Long inference. [00:03:02] swyx: Let's start there. Yeah, yeah. So, just to recap at the top, the five trends that I picked, and obviously if you have some that I did not cover, please suggest something. [00:03:13] swyx: The five are long inference, synthetic data, alternative architectures, mixture of experts, and online LLMs. And something that I think might be a bit controversial is this is a sorted list in the sense that I am not the guy saying that Mamba is like the future and, and so maybe that's controversial. [00:03:31] Direction 1: Long Inference (Planning, Search, AlphaGeometry, Flow Engineering) [00:03:31] swyx: But anyway, so long inference is a thesis I pushed before on the newsletter and on in discussing The thesis that, you know, Code Interpreter is GPT 4. 5. That was the title of the post. And it's one of many ways in which we can do long inference. You know, long inference also includes chain of thought, like, please think step by step. [00:03:52] swyx: But it also includes flow engineering, which is what Itamar from Codium coined, I think in January, where, basically, instead of instead of stuffing everything in a prompt, You do like sort of multi turn iterative feedback and chaining of things. In a way, this is a rebranding of what a chain is, what a lang chain is supposed to be. [00:04:15] swyx: I do think that maybe SGLang from ElemSys is a better name. Probably the neatest way of flow engineering I've seen yet, in the sense that everything is a one liner, it's very, very clean code. I highly recommend people look at that. I'm surprised it hasn't caught on more, but I think it will. It's weird that something like a DSPy is more hyped than a Shilang. [00:04:36] swyx: Because it, you know, it maybe obscures the code a little bit more. But both of these are, you know, really good sort of chain y and long inference type approaches. But basically, the reason that the basic fundamental insight is that the only, like, there are only a few dimensions we can scale LLMs. So, let's say in like 2020, no, let's say in like 2018, 2017, 18, 19, 20, we were realizing that we could scale the number of parameters. [00:05:03] swyx: 20, we were And we scaled that up to 175 billion parameters for GPT 3. And we did some work on scaling laws, which we also talked about in our talk. So the datasets 101 episode where we're like, okay, like we, we think like the right number is 300 billion tokens to, to train 175 billion parameters and then DeepMind came along and trained Gopher and Chinchilla and said that, no, no, like, you know, I think
Speaker CFPs and Sponsor Guides are now available for AIE World’s Fair — join us on June 25-27 for the biggest AI Engineer conference of 2024! Soumith Chintala needs no introduction in the ML world — his insights are incredibly accessible across Twitter, LinkedIn, podcasts, and conference talks (in this pod we’ll assume you’ll have caught up on the History of PyTorch pod from last year and cover different topics). He’s well known as the creator of PyTorch, but he's more broadly the Engineering Lead on AI Infra, PyTorch, and Generative AI at Meta. Soumith was one of the earliest supporters of Latent Space (and more recently AI News), and we were overjoyed to catch up with him on his latest SF visit for a braindump of the latest AI topics, reactions to some of our past guests, and why Open Source AI is personally so important to him. Life in the GPU-Rich Lane Back in January, Zuck went on Instagram to announce their GPU wealth: by the end of 2024, Meta will have 350k H100s. By adding all their GPU clusters, you'd get to 600k H100-equivalents of compute. At FP16 precision, that's ~1,200,000 PFLOPS. If we used George Hotz's (previous guest!) "Person of Compute" measure, Meta now has 60k humans of compute in their clusters. Occasionally we get glimpses into the GPU-rich life; on a recent ThursdAI chat, swyx prompted PaLM tech lead Yi Tay to write down what he missed most from Google, and he commented that UL2 20B was trained by accidentally leaving the training job running for a month, because hardware failures are so rare in Google. Meta AI’s Epic LLM Run Before Llama broke the internet, Meta released an open source LLM in May 2022, OPT-175B, which was notable for how “open” it was - right down to the logbook! They used only 16 NVIDIA V100 GPUs and Soumith agrees that, with hindsight, it was likely under-trained for its parameter size. In Feb 2023 (pre Latent Space pod), Llama was released, with a 7B version trained on 1T tokens alongside 65B and 33B versions trained on 1.4T tokens. The Llama authors included Guillaume Lample and Timothée Lacroix, who went on to start Mistral. July 2023 was Llama2 time (which we covered!): 3 model sizes, 7B, 13B, and 70B, all trained on 2T tokens. The three models accounted for a grand total of 3,311,616 GPU hours for all pre-training work. CodeLlama followed shortly after, a fine-tune of Llama2 specifically focused on code generation use cases. The family had models in the 7B, 13B, 34B, and 70B size, all trained with 500B extra tokens of code and code-related data, except for 70B which is trained on 1T. All of this on top of other open sourced models like Segment Anything (one of our early hits!), Detectron, Detectron 2, DensePose, and Seamless, and in one year, Meta transformed from a company people made fun of for its “metaverse” investments to one of the key players in the AI landscape and its stock has almost tripled since (about $830B in market value created in the past year). Why Open Source AI The obvious question is why Meta would spend hundreds of millions on its AI efforts and then release them for free. Zuck has addressed this in public statements: But for Soumith, the motivation is even more personal: “I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India… And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for like zero dollars. And I think that was a strong reason why I ended up where I am. So like that, like the open source side of things, I always push regardless of like what I get paid for, like I think I would do that as a passion project on the side… …I think at a fundamental level, the most beneficial value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me… …Like, okay, I again always go back to like I'm a student in India with no money. What is my accessibility to any of these closed source models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control issue: I strongly believe if you want human aligned AI, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble. We like the way Soumith put it last year: Closed AI “rate-limits against people's imaginations and needs”! What It Takes For Open Source AI to Win However Soumith doesn’t think Open Source will simply win by popular demand. There is a tremendous coordination problem with the decentralized nature of the open source AI development right now: nobody is collecting the valuable human feedback in the way that OpenAI or Midjourney are doing. “Open source in general always has a coordination problem. If there's a vertically integrated provider with more resources, they will just be better coordinated than open source. And so now open source has to figure out how to have coordinated benefits. And the reason you want coordinated benefits is because these models are getting better based on human feedback. And if you see with open source models, like if you go to the /r/localllama subreddit, like there's so many variations of models that are being produced from, say, Nous research. I mean, like there's like so many variations built by so many people. And one common theme is they're all using these fine-tuning or human preferences datasets that are very limited and they're not sufficiently diverse. And you look at the other side, say front-ends like Oobabooga or like Hugging Chat or Ollama, they don't really have feedback buttons. All the people using all these front-ends, they probably want to give feedback, but there's no way for them to give feedback… So we're just losing all of this feedback. Maybe open source models are being as used as GPT is at this point in like all kinds of, in a very fragmented way, like in aggregate all the open source models together are probably being used as much as GPT is, maybe close to that. But the amount of feedback that is driving back into the open source ecosystem is like negligible, maybe less than 1% of like the usage. So I think like some, like the blueprint here I think is you'd want someone to create a sinkhole for the feedback… I think if we do that, if that actually happens, I think that probably has a real chance of the open source models having a runaway effect against OpenAI, I think like there's a clear chance we can take at truly winning open source.” If you’re working on solving open source coordination, please get in touch! Show Notes * Soumith Chintala Twitter * History of PyTorch episode on Gradient Podcast * The Llama Ecosystem * Apple's MLX * Neural ODEs (Ordinary Differential Equations) * AlphaGo * LMSys arena * Dan Pink's "Drive" * Robotics projects: * Dobb-E * OK Robot * Yann LeCun * Yangqing Jia of Lepton AI * Ed Catmull * George Hotz on Latent Space * Chris Lattner on Latent Space * Guillaume Lample * Yannic Kilcher of OpenAssistant * LMSys * Alex Atallah of OpenRouter * Carlo Sferrazza's 3D tactile research * Alex Wiltschko of Osmo * Tangent by Alex Wiltschko * Lerrel Pinto - Robotics Timestamps * [00:00:00] Introductions * [00:00:51] Extrinsic vs Intrinsic Success * [00:02:40] Importance of Open Source and Its Impact * [00:03:46] PyTorch vs TinyGrad * [00:08:33] Why PyTorch is the Switzerland of frameworks * [00:10:27] Modular's Mojo + PyTorch? * [00:13:32] PyTorch vs Apple's MLX * [00:16:27] FAIR / PyTorch Alumni * [00:18:50] How can AI inference providers differentiate? * [00:21:41] How to build good benchmarks and learnings from AnyScale's * [00:25:28] Most interesting unexplored ideas * [00:28:18] What people get wrong about synthetic data * [00:35:57] Meta AI's evolution * [00:38:42] How do you allocate 600,000 GPUs? * [00:42:05] Even the GPU Rich are GPU Poor * [00:47:31] Meta's MTIA silicon * [00:50:09] Why we need open source * [00:59:00] Open source's coordination problem for feedback gathering * [01:08:59] Beyond text generation * [01:15:37] Osmo and the Future of Smell Recognition Technology Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:15]: Hey, and today we have in the studio Soumith Chintala, welcome. Soumith [00:00:17]: Thanks for having me. Swyx [00:00:18]: On one of your rare visits from New York where you live. You got your start in computer vision at NYU with Yann LeCun. That was a very fortuitous start. I was actually listening to your interview on the Gradient podcast. So if people want to know more about the history of Soumith, history of PyTorch, they can go to that podcast. We won't spend that much time there, but I just was marveling at your luck, or I don't know if it's your luck or your drive to find AI early and then find the right quality mentor because I guess Yan really sort of introduced you to that world. Soumith [00:00:51]: Yeah, I think you're talking about extrinsic success, right? A lot of people just have drive to do things that they think is fun, and a lot of those things might or might not be extrinsically perceived as good and successful. I think I just happened to like something that is now one of the coolest things in the world or whatever. But if I happen, the first thing
This Friday we’re doing a special crossover event in SF with Dylan Patel of SemiAnalysis (previous guest!), and we will do a live podcast on site. RSVP here. Also join us on June 25-27 for the biggest AI Engineer conference of the year! Replicate is one of the most popular AI inference providers, reporting over 2 million users as of their $40m Series B with a16z. But how did they get there? The Definitive Replicate Story (warts and all) Their overnight success took 5 years of building, and it all started with arXiv Vanity, which was a 2017 vacation project that scrapes arXiv PDFs and re-renders them into semantic web pages that reflow nicely with better typography and whitespace. From there, Ben and Andreas’ idea was to build tools to make ML research more robust and reproducible by making it easy to share code artefacts alongside papers. They had previously created Fig, which made it easy to spin up dev environments; it was eventually acquired by Docker and turned into `docker-compose`, the industry standard way to define services from containerized applications. 2019: Cog The first iteration of Replicate was a Fig-equivalent for ML workloads which they called Cog; it made it easy for researchers to package all their work and share it with peers for review and reproducibility. But they found that researchers were terrible users: they’d do all this work for a paper, publish it, and then never return to it again. “We talked to a bunch of researchers and they really wanted that.... But how the hell is this a business, you know, like how are we even going to make any money out of this? …So we went and talked to a bunch of companies trying to sell them something which didn't exist. So we're like, hey, do you want a way to share research inside your company so that other researchers or say like the product manager can test out the machine learning model? They're like, maybe. Do you want like a deployment platform for deploying models? Do you want a central place for versioning models? We were trying to think of lots of different products we could sell that were related to this thing… So we then got halfway through our YC batch. We hadn't built a product. We had no users. We had no idea what our business was going to be because we couldn't get anybody to like buy something which didn't exist. And actually there was quite a way through our, I think it was like two thirds the way through our YC batch or something. And we're like, okay, well we're kind of screwed now because we don't have anything to show at demo day.” The team graduated YCombinator with no customers, no product and nothing to demo - which was fine because demo day got canceled as the YC W’20 class graduated right into the pandemic. The team spent the next year exploring and building Covid tools. 2021: CLIP + GAN = PixRay By 2021, OpenAI released CLIP. Overnight dozens of Discord servers got spun up to hack on CLIP + GANs. Unlike academic researchers, this community was constantly releasing new checkpoints and builds of models. PixRay was one of the first models being built on Replicate, and it quickly started taking over the community. Chris Dixon has a famous 2010 post titled “The next big thing will start out looking like a toy”; image generation would have definitely felt like a toy in 2021, but it gave Replicate its initial boost. 2022: Stable Diffusion In August 2022 Stable Diffusion came out, and all the work they had been doing to build this infrastructure for CLIP / GANs models became the best way for people to share their StableDiffusion fine-tunes: And like the first week we saw people making animation models out of it. We saw people make game texture models that use circular convolutions to make repeatable textures. We saw a few weeks later, people were fine tuning it so you could put your face in these models and all of these other ways. […] So tons of product builders wanted to build stuff with it. And we were just sitting in there in the middle, as the interface layer between all these people who wanted to build, and all these machine learning experts who were building cool models. And that's really where it took off. Incredible supply, incredible demand, and we were just in the middle. (Stable Diffusion also spawned Latent Space as a newsletter) The landing page paved the cowpath for the intense interest in diffusion model APIs. 2023: Llama & other multimodal LLMs By 2023, Replicate’s growing visibility in the Stable Diffusion indie hacker community came from top AI hackers like Pieter Levels and Danny Postmaa, each making millions off their AI apps: Meta then released LLaMA 1 and 2 (our coverage of it), greatly pushing forward the SOTA open source model landscape. Demand for text LLMs and other modalities rose, and Replicate broadened its focus accordingly, culminating in a $18m Series A and $40m Series B from a16z (at a $350m valuation). Building standards for the AI world Now that the industry is evolving from toys to enterprise use cases, all these companies are working to set standards for their own space. We cover this at ~45 mins in the podcast. Some examples: * LangChain has been trying to establish "chain” as the standard mental models when putting multiple prompts and models together, and the “LangChain Expression Language” to go with it. (Our episode with Harrison) * LLamaHub for packaging RAG utilities. (Our episode with Jerry) * Ollama’s Modelfile to define runtimes for different model architectures. These are usually targeted at local inference. * Cog (by Replicate) to create environments to which you can easily attach CUDA devices and make it easy to spin up inference on remote servers. * GGUF as the filetype ggml-based executors. None of them have really broken out yet, but this is going to become a fiercer competition as the market matures. Full Video Podcast As a reminder, all Latent Space pods now come in full video on our YouTube, with bonus content that we cut for time! Show Notes * Ben Firshman * Replicate * Free $10 credit for Latent Space readers * Andreas Jansson (Ben’s co-founder) * Charlie Holtz (Replicate’s Hacker in Residence) * Fig (now Docker Compose) * Command Line Interface Guidelines (clig) * Apple Human Interface Guidelines * arXiv Vanity * Open Interpreter * PixRay * SF Compute * Big Sleep by Advadnoun * VQGAN-CLIP by Rivers Have Wings Timestamps * [00:00:00] Introductions * [00:01:17] Low latency is all you need * [00:04:08] Evolution of CLIs * [00:05:59] How building ArxivVanity led to Replicate * [00:11:37] Making ML research replicable with containers * [00:17:22] Doing YC in 2020 and pivoting to tools for COVID * [00:20:22] Launching the first version of Replicate * [00:25:51] Embracing the generative image community * [00:28:04] Getting reverse engineered into an API product * [00:31:25] Growing to 2 million users * [00:34:29] Indie vs Enterprise customers * [00:37:09] How Unsplash uses Replicate * [00:38:29] Learnings from Docker that went into Cog * [00:45:25] Creating AI standards * [00:50:05] Replicate's compute availability * [00:53:55] Fixing GPU waste * [01:00:39] What's open source AI? * [01:04:46] Building for AI engineers * [01:06:41] Hiring at Replicate This summary covers the full range of topics discussed throughout the episode, providing a comprehensive overview of the content and insights shared. Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:14]: Hey, and today we have Ben Firshman in the studio. Welcome Ben. Ben [00:00:18]: Hey, good to be here. Swyx [00:00:19]: Ben, you're a co-founder and CEO of Replicate. Before that, you were most notably founder of Fig, which became Docker Compose. You also did a couple of other things before that, but that's what a lot of people know you for. What should people know about you that, you know, outside of your, your sort of LinkedIn profile? Ben [00:00:35]: Yeah. Good question. I think I'm a builder and tinkerer, like in a very broad sense. And I love using my hands to make things. So like I work on, you know, things may be a bit closer to tech, like electronics. I also like build things out of wood and I like fix cars and I fix my bike and build bicycles and all this kind of stuff. And there's so much, I think I've learned from transferable skills, from just like working in the real world to building things, building things in software. And you know, it's so much about being a builder, both in real life and, and in software that crosses over. Swyx [00:01:11]: Is there a real world analogy that you use often when you're thinking about like a code architecture or problem? Ben [00:01:17]: I like to build software tools as if they were something real. So I wrote this thing called the command line interface guidelines, which was a bit like sort of the Mac human interface guidelines, but for command line interfaces, I did it with the guy I created Docker Compose with and a few other people. And I think something in there, I think I described that your command line interface should feel like a big iron machine where you pull a lever and it goes clunk and like things should respond within like 50 milliseconds as if it was like a real life thing. And like another analogy here is like in the real life, you know, when you press a button on an electronic device and it's like a soft switch and you press it and nothing happens and there's no physical feedback of anything happening, then like half a second later, something happens. Like that's how a lot of software feels, but instead like software should feel more like something that's real where you touch, you pull a physical lever and the physical lever moves, you know, and I've taken that lesson of kind of human interface to, to software a ton. You know, it's all about kind of
We’re writing this one day after the monster release of OpenAI’s Sora and Gemini 1.5. We covered this on Alex Volkov ‘s ThursdAI space, so head over there for our takes. IRL: We’re ONE WEEK away from Latent Space: Final Frontiers, the second edition and anniversary of our first ever Latent Space event! Also: join us on June 25-27 for the biggest AI Engineer conference of the year! Online: All three Discord clubs are thriving. Join us every Wednesday/Friday! Almost 12 years ago, while working at Spotify, Erik Bernhardsson built one of the first open source vector databases, Annoy, based on ANN search. He also built Luigi, one of the predecessors to Airflow, which helps data teams orchestrate and execute data-intensive and long-running jobs. Surprisingly, he didn’t start yet another vector database company, but instead in 2021 founded Modal, the “high-performance cloud for developers”. In 2022 they opened doors to developers after their seed round, and in 2023 announced their GA with a $16m Series A. More importantly, they have won fans among both household names like Ramp, Scale AI, Substack, and Cohere, and newer startups like (upcoming guest!) Suno.ai and individual hackers (Modal was the top tool of choice in the Vercel AI Accelerator): We've covered the nuances of GPU workloads, and how we need new developer tooling and runtimes for them (see our episodes with Chris Lattner of Modular and George Hotz of tiny to start). In this episode, we run through the major limitations of the actual infrastructure behind the clouds that run these models, and how Erik envisions the “postmodern data stack”. In his 2021 blog post “Software infrastructure 2.0: a wishlist”, Erik had “Truly serverless” as one of his points: * The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me. * I don't ever want to provision anything in advance of load. * I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using. * Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle. Swyx called this Self Provisioning Runtimes back in the day. Modal doesn’t put you in YAML hell, preferring to colocate infra provisioning right next to the code that utilizes it, so you can just add GPU (and disk, and retries…): After 3 years, we finally have a big market push for this: running inference on generative models is going to be the killer app for serverless, for a few reasons: * AI models are stateless: even in conversational interfaces, each message generation is a fully-contained request to the LLM. There’s no knowledge that is stored in the model itself between messages, which means that tear down / spin up of resources doesn’t create any headaches with maintaining state. * Token-based pricing is better aligned with serverless infrastructure than fixed monthly costs of traditional software. * GPU scarcity makes it really expensive to have reserved instances that are available to you 24/7. It’s much more convenient to build with a serverless-like infrastructure. In the episode we covered a lot more topics like maximizing GPU utilization, why Oracle Cloud rocks, and how Erik has never owned a TV in his life. Enjoy! Show Notes * Modal * ErikBot * Erik’s Blog * Software Infra 2.0 Wishlist * Luigi * Annoy * Hetzner * CoreWeave * Cloudflare FaaS * Poolside AI * Modular Inference Engine Chapters * [00:00:00] Introductions * [00:02:00] Erik's OSS work at Spotify: Annoy and Luigi * [00:06:22] Starting Modal * [00:07:54] Vision for a "postmodern data stack" * [00:10:43] Solving container cold start problems * [00:12:57] Designing Modal's Python SDK * [00:15:18] Self-Revisioning Runtime * [00:19:14] Truly Serverless Infrastructure * [00:20:52] Beyond model inference * [00:22:09] Tricks to maximize GPU utilization * [00:26:27] Differences in AI and data science workloads * [00:28:08] Modal vs Replicate vs Modular and lessons from Heroku's "graduation problem" * [00:34:12] Creating Erik's clone "ErikBot" * [00:37:43] Enabling massive parallelism across thousands of GPUs * [00:39:45] The Modal Sandbox for agents * [00:43:51] Thoughts on the AI Inference War * [00:49:18] Erik's best tweets * [00:51:57] Why buying hardware is a waste of money * [00:54:18] Erik's competitive programming backgrounds * [00:59:02] Why does Sweden have the best Counter Strike players? * [00:59:53] Never owning a car or TV * [01:00:21] Advice for infrastructure startups Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:14]: Hey, and today we have in the studio Erik Bernhardsson from Modal. Welcome. Erik [00:00:19]: Hi. It's awesome being here. Swyx [00:00:20]: Yeah. Awesome seeing you in person. I've seen you online for a number of years as you were building on Modal and I think you're just making a San Francisco trip just to see people here, right? I've been to like two Modal events in San Francisco here. Erik [00:00:34]: Yeah, that's right. We're based in New York, so I figured sometimes I have to come out to capital of AI and make a presence. Swyx [00:00:40]: What do you think is the pros and cons of building in New York? Erik [00:00:45]: I mean, I never built anything elsewhere. I lived in New York the last 12 years. I love the city. Obviously, there's a lot more stuff going on here and there's a lot more customers and that's why I'm out here. I do feel like for me, where I am in life, I'm a very boring person. I kind of work hard and then I go home and hang out with my kids. I don't have time to go to events and meetups and stuff anyway. In that sense, New York is kind of nice. I walk to work every morning. It's like five minutes away from my apartment. It's very time efficient in that sense. Yeah. Swyx [00:01:10]: Yeah. It's also a good life. So we'll do a brief bio and then we'll talk about anything else that people should know about you. Actually, I was surprised to find out you're from Sweden. You went to college in KTH and your master's was in implementing a scalable music recommender system. Yeah. Erik [00:01:27]: I had no idea. Yeah. So I actually studied physics, but I grew up coding and I did a lot of programming competition and then as I was thinking about graduating, I got in touch with an obscure music streaming startup called Spotify, which was then like 30 people. And for some reason, I convinced them, why don't I just come and write a master's thesis with you and I'll do some cool collaborative filtering, despite not knowing anything about collaborative filtering really. But no one knew anything back then. So I spent six months at Spotify basically building a prototype of a music recommendation system and then turned that into a master's thesis. And then later when I graduated, I joined Spotify full time. Swyx [00:02:00]: So that was the start of your data career. You also wrote a couple of popular open source tooling while you were there. Is that correct? Erik [00:02:09]: No, that's right. I mean, I was at Spotify for seven years, so this is a long stint. And Spotify was a wild place early on and I mean, data space is also a wild place. I mean, it was like Hadoop cluster in the like foosball room on the floor. It was a lot of crude, like very basic infrastructure and I didn't know anything about it. And like I was hired to kind of figure out data stuff. And I started hacking on a recommendation system and then, you know, got sidetracked in a bunch of other stuff. I fixed a bunch of reporting things and set up A-B testing and started doing like business analytics and later got back to music recommendation system. And a lot of the infrastructure didn't really exist. Like there was like Hadoop back then, which is kind of bad and I don't miss it. But I spent a lot of time with that. As a part of that, I ended up building a workflow engine called Luigi, which is like briefly like somewhat like widely ended up being used by a bunch of companies. Sort of like, you know, kind of like Airflow, but like before Airflow. I think it did some things better, some things worse. I also built a vector database called Annoy, which is like for a while, it was actually quite widely used. In 2012, so it was like way before like all this like vector database stuff ended up happening. And funny enough, I was actually obsessed with like vectors back then. Like I was like, this is going to be huge. Like just give it like a few years. I didn't know it was going to take like nine years and then there's going to suddenly be like 20 startups doing vector databases in one year. So it did happen. In that sense, I was right. I'm glad I didn't start a startup in the vector database space. I would have started way too early. But yeah, that was, yeah, it was a fun seven years as part of it. It was a great culture, a great company. Swyx [00:03:32]: Yeah. Just to take a quick tangent on this vector database thing, because we probably won't revisit it but like, has anything architecturally changed in the last nine years? Erik [00:03:41]: I'm actually not following it like super closely. I think, you know, some of the best algorithms are still the same as like hierarchical navigable small world. Swyx [00:03:51]: Yeah. HNSW. Erik [00:03:52]: Exactly. I think now there's like product quantization, there's like some other stuff that I haven't really followed super closely. I mean, obviously, like back then it was like, you know, it's always like very simple. It's like a C++ library with Python bindings and you could mmap big files and into memory and like they had some lookups. I used like this kind of recursive, like hyperspace spli
Our first ever demo day aimed for 15-20 people and ended up ballooning to >200 and covered in the news. We are now running the 2024 edition in SF on Feb 23: Latent Space Final Frontiers, a startup and research competition in “The Autonomous Workforce”, ”Beyond Transformers & GPUs”, and “Embodied AI”. RSVP here! You can find all LS online/IRL events on our new calendar. Super Early Bird tickets have just gone on sale for AI Engineer World’s Fair, June 25-27! Today we have the honor of hosting two of Together AI’s co-founders: Ce Zhang (CTO) and Vipul Ved Prakash (CEO). This is a rare opportunity to recap the history of the company since our last check-in with Tri Dao (Chief Scientist), some of their big releases, and do a deep dive into the state of the AI inference market. Together has emerged as one of the most consequential new startups in the new AI summer, last announcing a ~$100m Series A raise in November (at a ~$360-565m valuation). Note from future: about a week after this pod was published, rumors were confirmed that Salesforce had led another $100m Series B at a $1b valuation. But there are at least three Togethers - Together the Research Lab, Together the Fine Tuning & Inference platform, and Together the custom models service. As we clarify on the pod, the overarching philosophy of Together is the ability to improve on all these fronts simultaneously by being “full stack”, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms. Bringing Research and Industry Together In just one year, Together has been behind some of the most exciting research in AI: * RedPajama, a fully open source dataset for model pre-training which mirrored the Llama1 recipe. Then followed by RedPajama2, a 30T tokens dataset of filtered and de-duplicated tokens. * RedPajama-INCITE-3B and 7B, which were SOTA in a few benchmarks at the time of release. * FlashAttention-2, developed by Together’s Chief Scientist Tri Dao. We covered FA-2 in a previous episode with him. * Mamba-3B, the most promising transformer-alternative model that they released in collaboration with Cartesia. * StripedHyena, a SOTA graft of Hyena state space models and transformer models together * Medusa, an alternative to speculative decoding that lets you use multiple decoding heads instead of a draft model. * MonarchMixer, which was one of the most popular orals at NeurIPS 2023. It’s an approach to transformers that replaces many of its core parts with Monarch matrices for better computational efficiency. And I’m sure we missed something! As Vipul reveals, almost 50% of Together staff is researchers, and two of their co-founders (Chris Ré and Percy Liang) are professors at Stanford, so we can expect a lot more here. Bringing “Disaggregated” GPUs Together On their cloud, they offer inference as a service, fine-tuning, pre-training, etc, but unlike other providers they think of themselves as a disaggregated cloud. Today, they have ~8,000 A100 and H100 GPUs on their platform (an exclusive revealed on the pod!) totaling over 20 exaflops of compute, but instead of just buying more and putting them in a cluster and then exposing a `us-east-1` option for customers, they are taking heterogenous compute sources and adding a unified layer on top of it for developers to consume. Building on Ce’s research, Together’s GPU Clusters are taking on comparable AWS and GCP offerings in both cost and speed: Take the Hessian AI center in Germany or the DoE’s INCITE; they have GPUs that they want to share with researchers, but they lack the cloud layer over it. Similarly, there’s starting to be more and more differentiation amongst types of GPUs: H100s, A100s, MI3000s, etc. Each of them has different availability and performance based on task, and the end user shouldn’t have to be an hardware expert to run inference on a model, so Together abstracts a lot of that away. A big theme of the Together inference stack, a “bag of 50 tricks” that we discuss on the pod, is also “hardware-aware” algorithms like FlashAttention and Mamba, which further emphasize the benefits of co-developing everything together: Special Focus: Transformer Alternatives As we mentioned above, they are also funding a lot of research in Transformer alternatives. To reiterate a few points on why they matter: * Longer context is not the motivation for sub-quadratic architectures: Transformers don’t inherently have hard limitations on context size, but they just get extremely expensive. When developing sub-quadratic alternatives, you easily enable very long context, but that’s now how you should compare them. Even at same context size, inference and training is much cheaper on sub-quadratic architectures like Hyena. * Emergence of hybrid architectures: a lot of early conversations have been around the “post-Transformers” era, but it might be more like “half-Transformers”. Hybrid architectures could have split layers with some transformer-based and some state-space ones. One of the challenges is that a lot of hardware kernels are optimized for transformer operations, so you’d lose a lot by moving away completely. * Higher speed = higher GPU throughput: if we could reach the same benchmark performance on subquadratic architectures, it’d solve a lot of the GPU crunch. Today we peak at ~170 tok/s on inference in some open models; if we could reach 5,000 tok/s on the same card, you’d be able to serve 30x more customers on the same hardware. As a cloud provider, you’re obviously incentivized to get there. We had a lot of fun chatting with the Together guys and we covered a lot of ground, so enjoy the conversation! Note: This is the first episode of a “cloud providers mini-series”. We have Erik from Modal and Ben from Replicate coming up next! Video Podcast Join us to watching the video version of this pod on our snazzy YouTube! Show Notes * Together AI * RedPajama Dataset v1 Announcement * RedPajama Models v1 Announcement * Together Embeddings * StripedHyena-7B * Mamba-3B-SlimPJ * Vipul's X thread on Anyscale * Vipul's Razor * SemiAnalysis' "Inference Race to the Bottom" post * Chris Ré * Mike Conover's episode * Slim Pajama by Cerebras * Dolma by AI2 * Jina AI * Tengyu's Voyage AI Timestamps * [00:00:00] Introductions * [00:00:43] Origin and current state of Together.ai * [00:02:15] Transition from Apple to Together and the vision for open AI * [00:04:54] How Chris Ré introduced Ce and Vipul * [00:08:43] How RedPajama came to be * [00:13:34] Model training and Transformer alternatives * [00:15:37] DSIR and the importance of data in LLMs * [00:21:19] Inference vs Fine-tuning vs Pre-training usage on Together * [00:23:20] Together's GPU stash * [00:27:02] Why standardization of inference metrics is important * [00:29:26] Building moats in AI inference * [00:31:49] Federated vs disaggregated cloud computing * [00:34:57] Opportunities for improvement in the inference stack * [00:36:13] Anyscale benchmarking drama * [00:41:27] Not just an inference platform * [00:43:50] Together Embeddings and the future of embedding models * [00:45:53] State space models and hybrid architectures * [00:53:52] The need for 5,000 tokens/s speed in AI inference * [01:00:23] What's the most interesting unsolved question in AI? Transcript Alessio [00:00:00]: Hey, everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:14]: Hey, and today we're together with Together. Welcome to the studio, guys. Ce / Vipul [00:00:20]: Thank you. Swyx [00:00:21]: I don't know how you typically give self intros, but does anyone want to go first? How do we get our audience acquainted, especially to who's speaking, because it's unusual for us to do a four-person pod. Yeah. Ce [00:00:33]: Hi, everyone. I'm Ce. I'm one of the co-founders of Together and the CTO, working with the team on technical things. Vipul [00:00:40]: I'm Vipul Ved Prakash, co-founder and CEO of Together. Swyx [00:00:43]: I always consider you guys as one of the sort of all-in-one companies. I always want to say labs, but I feel like you're not a lab. What is the sort of origin of Together, and then what is it today? I feel like it used to be Together.xyz, and then now you're Together.ai. Vipul [00:01:00]: I think fundamentally, Together is about open and independent AI systems. We think this is one of the most consequential technologies of our time, and when we started the company in June 2022, our focus was to build a platform for open source, independent, user-owned AI systems. One way to think about it is big labs, frontier model labs, have built their own platforms for developer platforms for their models. We think of Together as a platform for everything else, whether these are open models, whether these are models being built by companies that are owned by them. Our sort of XYZ roots, we have a fairly deep decentralization and open ethos that kind of reflects in all our platform and strategy and business. And we also, the way we structure our cloud is by combining data centers around the world instead of, you know, we are today not located in hyperscalers, we have built a footprint of AI supercomputers in this sort of very disaggregated, decentralized manner. Alessio [00:02:15]: I know before Together, you were at Apple, so you go from like the most walled garden, private, we don't say anything company, to we want everything to be open and everybody to know somebody. What maybe did you learn from like the Apple way of being super close and polished and maybe what are you taking now to Together to make it open, but also a very nice developer experience? Vipul [00:02:37]: Yeah, I would say, you know, one sort of my, you know, background has been in open source for a long tim
We are announcing the second edition of our Latent Space demo day event in SF on 2/23: Final Frontiers, a startup and research competition in “The Autonomous Workforce”, ”Beyond Transformers & GPUs”, and “Embodied AI”. RSVP here! The first one was aimed for 15-20 people and ended up blowing up to >200 and covered in the Information - let’s see what a year of growth (and competition) does to the local events space in 2024. You can find all Latent Space events here, and of course get in touch with us to host your own AI Engineer meetups like AI Engineering Singapore. In our December 2023 recap we covered the Four Wars of the AI stack. But how do we know when it’s time to crown a winner? As we kick off 2024, we wanted to do a recap of the State of AI in 2023 to set a baseline of adoption for different products. Retool had a great report at the end of last year which covered a lot of it. David Hsu, CEO and co-founder of Retool, joined us to go over it together. We also talked about the history of Retool, why they were too embarrassed to present at YC demo day, and how they got to $1M ARR with 3 employees. If you’re a founder, there are a lot of nuggets of advice in here! Retool AI In our modeling of the “Software 3.0 Stack”, we have generally left a pretty wide open gap as to the “user interface” equivalent of the AI stack: Retool AI launched 4 months ago with some nifty features for SQL generation, and its own hosted vector storage service (using pgvector). However, as he explains on the pod, the more interesting potential of Retool is in helping developers build AI infused applications quickly, in combination with its Workflows feature. This moves Retool down the stack from just the UI for internal tooling to the business logic “piping” as well. There are a bunch of dedicated tools in this space like Respell, BuildShip, Flowise, and Ironclad Rivet. "We think that practically every internal app is going to be AI infused over the next three years." - David on the pod RIP StackOverflow? In July 2023 we talked about the impact of ChatGPT and Copilot: This was then disputed by StackOverflow, who pointed out (very fairly so) that there were privacy-related changes in their analytics instrumentation in 2022. StackOverflow no longer reports traffic, but based on StackOverflow’s continuing transparency we can see that organic declines have continued throughout 2023. Retool’s report comes over a year after those changes and has some self reported samples from users: * 57.6% of people said they have used StackOverflow less; almost all of them replaced it with ChatGPT and Copilot. * 10.2% said they no longer use StackOverflow. We also saw a lot more tools being released in the dev tools space such as (one of our oldest pod friends) Codeium (which just raised a $65M Series B), SourceGraph (and their newly released Cody), Codium AI (just released AlphaCodium which was picked up by Karpathy), Phind (which beat GPT-4 with OSS models), and Cursor, one of the most beloved products in the dev community at the moment. Intelligence is getting closer and closer to the IDE, and the trend doesn’t seem to be reverting. We already said that “You are not too old (to pivot into AI)“, and the advice still stands. When asked to rate “Preference for hiring engineers effective at using ChatGPT/Copilot for coding” on a scale of 1 to 10, where 10 is “Much more likely”, ~40% of companies voted 8-10. Having an AI Engineer skillset is extremely important. 45% of companies between 1,000-4,999 employees said that they increased the difficulty of technical interviews to compensate for these new tools, so the gap between users and non-users will keep widening. Crossing the AI in Production Chasm Geoffrey Moore’s “Crossing the Chasm” is one of the most quoted business frameworks. Every market has an initial group of Innovators and Early Adopters, who are willing to suffer through the rough edges of products initially, and eventually crosses into the Early Majority, which expects a full product. In the AI world, ChatGPT and Midjourney / DALL-E have crossed the chasm in the consumer space. Copilot is probably the only tool that did it in the enterprise, having crossed 1M paid users. ~$50B were invested in AI in 2023, and we still only have According to the survey, only 25% of companies had real production usage, but 77.1% said their company is making efforts to adopt more. Closing that gap could triple AI adoption in one year. The report also broke down adoption by use case. 66% of companies use it internally, while only 43% do so in customer-facing use cases. Internal usage of AI is much more varied than customer-facing one as well: One point that David made in the podcast is that this number isn’t a knock on AI as a tool, but rather about the demographics of businesses outside of our Silicon Valley bubble: We all work in Silicon Valley, right? We all work at businesses, basically, that sell software as a business. And that's why all the software engineers that we hire basically work on external facing software, which makes sense with most software companies. But if you look at most companies in the world, most companies in the world are actually not software companies. […] Most of the [work of] software engineers in the world actually goes towards these internal facing applications. Beyond code models, it’s clear that the big winners of the first wave of AI adoption are vector stores and RAG. Knowledge base Q&A, customer chatbots, recommendation systems, etc are all based on them. Retool even rolled out their own with Retool Vectors. Expect the battlefield to get even hotter in these areas, with Mongo and Chroma leading the charge on a NPS/popularity basis. It’s also clear that OpenAI won the first campaign in the AI models war, by far. Hopefully Mistral and LLaMA3 will shake up this chart when we look back at it in 2025: TLDR: We’re really early. If you want to build in AI, there’s a ton of work to be done, and a lot of problems to be solved. You can find the full report here to dive through all the numbers. Video podcast Watch along on our snazzy YouTube! Show Notes Companies and Projects: * Retool * State of AI Report * Retool AI * Retool Workflows * Raising less money at lower valuations * Paul Graham's "playing house" essay * Gödel, Escher, Bach (GEB) Timestamps * [00:00:00] Introduction * [00:02:43] Retool's founding story and decision not to present at YC demo day initially * [00:09:08] Philosophy on fundraising - raising less money at lower valuations * [00:12:53] Overview of what Retool is * [00:15:41] Origin story of Retool AI product * [00:19:59] Decision to use open source vector database PG Vector * [00:21:29] Most underrated AI use cases * [00:25:56] Retool's AI UX and workflows * [00:30:38] Zapier vs Retool * [00:32:54] Updates from Retool's 2023 State of AI survey * [00:35:21] Who is adopting AI first? * [00:37:40] Evolving engineering hiring practices in the age of Copilot/ChatGPT * [00:40:02] Retool's views on internal vs external AI adoption * [00:41:50] OSS models vs OpenAI in production * [00:44:46] Additional survey questions to ask in 2024 * [00:47:04] Balancing enterprise sales vs bottom-up adoption * [00:51:54] Philosophical thoughts on AGI and intentionality Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:16]: And today we are in the studio with David Hsu from Retool. Welcome. David [00:00:20]: Thanks. Excited to be here. Swyx [00:00:23]: We like to give a little bit of intro from what little we can get about you and then have you talk about something personal. You got your degree in philosophy and CS from Oxford. I wasn't aware that they did double degrees. Is that what you got? David [00:00:35]: It's actually a single degree, which is really cool. So basically you study content, you study philosophy, and you study intersection. Intersection is basically AI, actually, and sort of computers think, or computers be smart. What does it mean for a computer to be smart? As well as logic. It's also another intersection, which is really fun too. Swyx [00:00:51]: In Stanford, it might be symbolic systems or whatever. It's always hard to classify these things when we don't really have a word for it. Now I guess everything's just called AI. Five years ago, you launched Retool. You were in YC at winter 17 and just been a straight line up from there, right? David [00:01:09]: I wish. Swyx [00:01:10]: What's something on your LinkedIn that people should know about you? Maybe on their personal hobby or, you know, let's just say something you're very passionate about. David [00:01:17]: Yeah, sure. I read quite a bit. I probably read like two books a week around about. So it's a lot of fun. I love biking. It's also quite a bit of fun. So yeah. Swyx [00:01:25]: Do you use Retool to read? David [00:01:27]: No, I don't use Retool to read. No, that'd be funny. Swyx [00:01:30]: What do you read? How do you choose what you read? Any recommendations? David [00:01:35]: I'm mostly reading fiction nowadays. So fiction is a lot of fun. I think it helps me be more empathetic, if you will. I think it's a lot of fun. I actually just want to see what it's like to be in someone else's shoes. That's what I really good about philosophy as well. I find philosophy just so interesting, especially logic. We can talk more about that for probably hours if you want. Swyx [00:01:50]: So yeah, I have a casual interest in epistemology. And I think that any time you, you know, you're trying to figure out a way to solve a problem, you're going to have to figure out a way to solve it. David [00:02:05]: Yeah, totally. What does it mean to know? Alessio [00:02:13]: That's its own podcast. We should do a special edition about
Note for Latent Space Community members: we have now soft-launched meetups in Singapore, as well as two new virtual paper club/meetups for AI in Action and LLM Paper Club. We’re also running Latent Space: Final Frontiers, our second annual demo day hackathon from last year. Edit from March 2024: We did a followup on the Four Wars on the AI Breakdown. For the first time, we are doing an audio version of monthly AI Engineering recap that we publish on Latent Space! This month it’s “The Four Wars of the AI Stack”; you can find the full recap with all the show notes here: https://latent.space/p/dec-2023 * [00:00:00] Intro * [00:01:42] The Four Wars of the AI stack: Data quality, GPU rich vs poor, Multimodality, and Rag/Ops war * [00:03:17] Selection process for the four wars and notable mentions * [00:06:58] The end of low background tokens and the impact on data engineering * [00:08:36] The Quality Data Wars (UGC, licensing, synthetic data, and more) * [00:14:51] Synthetic Data * [00:17:49] The GPU Rich/Poors War * [00:18:21] Anyscale benchmark drama * [00:22:00] The math behind Mixtral inference costs * [00:28:48] Transformer alternatives and why they matter * [00:34:40] The Multimodality Wars * [00:38:10] Multiverse vs Metaverse * [00:45:00] The RAG/Ops Wars * [00:50:00] Will frameworks expand up, or will cloud providers expand down? * [00:54:32] Syntax to Semantics * [00:56:41] Outer Loop vs Inner Loop * [00:59:54] Highlight of the month This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Latent Space is heating up! Our paper club ran into >99 person Discord limits, oops. We are also introducing 2 new online meetups: LLM Paper Club Asia for Asia timezone (led by Ivan), and AI in Action: hands-on application of AI (led by KBall). To be notified of all upcoming Latent Space events, subscribe to our new Luma calendar (sign up for individual events, or hit the RSS icon to sync all events to calendar). In the halcyon open research days of 2022 BC (Before-ChatGPT), DeepMind was the first to create a SOTA multimodal model by taking a pre-existing LLM (Chinchilla 80B - now dead?) and pre-existing vision encoder (CLIP) and training a “glue” adapter layer, inspiring a generation of stunningly cheap and effective multimodal models including LLaVA (one of the Best Papers of NeurIPS 2023), BakLLaVA and FireLLaVA. However (for reasons we discuss in today’s conversation), DeepMind’s Flamingo model was never open sourced. Based on the excellent paper, LAION stepped up to create OpenFlamingo, but it never scaled beyond 9B. Simultaneously, the M4 (audio + video + image + text multimodality) research team at HuggingFace announced an independent effort to reproduce Flamingo up to the full 80B scale: The effort started in March, and was released in August 2023. We happened to visit Paris last year, and visited HuggingFace HQ to learn all about HuggingFace’s research efforts, and cover all the ground knowledge LLM people need to become (what Chip Huyen has termed) “LMM” people. In other words: What is IDEFICS? IDEFICS is an Open Access Visual Language Model, available in 9B and 80B model sizes. As an attempt to re-create an open-access version of Flamingo, it seems to track very well on a range of multimodal benchmarks (which we discuss in the pod): You can see the reasoning abilities of the models to take a combination of interleaved images + text in a way that allows users to either describe images, ask questions about the images, or extend/combine the images into different artworks (e.g. poetry). 📷 From IDEFICS’s model card and blog post The above demo screenshots are actually fine-tuned instruct versions of IDEFICS — which are again in 9B and 80B versions. IDEFICS was built by connecting two unimodal models together to provide the multi-modality you see showcased above. * Llama v1 for language (specifically huggyllama/llama-65b) - the best available open model at the time, to be swapped for Mistral in the next version of IDEFICS * A CLIP model for vision (specifically laion/CLIP-ViT-H-14-laion2B-s32B-b79K - after a brief exploration of EVA-CLIP, which we discuss on the pod) OBELICS: a new type of Multimodal Dataset IDEFICS’ training data used the usual suspect datasets, but to get to par with Flamingo they needed to create a new data set. Enter OBELICS: “An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”: * 115B text tokens * 141M English documents * 353M images These bullets are carefully curated and filtered by going through Common Crawl dumps between FEB 2020 - FEB 2023. We discuss the 2 months of mindnumbing, unglamorous work creating this pipeline: There’s a lot of mentions of ‘multi-modal' web documents’ which deserves some explanation. We’ll show you instead of tell you: You can see from this graph that OBELICS ends up outperforming the other image-text pairs dataset (LAION in this case) when stacked head-to-head. You can view a subset of OBELICS and perform visualizations on them here: 2024 Update: WebSight et al Most of this interview was recorded on Halloween 2023 at HuggingFace’s headquarters in Paris: In anticipation of an IDEFICS v2 release. However, several roadblocks emerged, including a notable scandal around CSAM in LAION-5B, which affected all models using that dataset. The M4 team have adopted a strategy of shipping smaller advancements in 2024, and the first ship of the year is WebSight, a dataset of 823,000 HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot (rendered with Playwright). This is intended for tasks like screenshot-to-code workflows like Vercel’s V0 or TLDraw, and will be part of the dataset for IDEFICS-2. As noted in our Best Papers recap, synthetic data is emerging as one of the top themes of 2024, and the IDEFICS/OBELICS team have wasted no time enabling themselves with it. Timestamps * [0:00:00] Intro * [0:00:00] Hugo, Leo’s path into multimodality * [0:09:16] From CLIP to Flamingo * [0:12:54] Benchmarks and Evals * [0:16:54] OBELICS dataset * [0:34:47] Together Redpajama v2 * [0:37:12] GPT4 Vision * [0:38:44] IDEFICS model * [0:40:57] Query-Key Layernorm for training * [0:46:40] Choosing smaller vision encoders - EVA-CLIP vs SIG-GLIP * [0:49:02] IDEFICS v2 * [0:52:39] Multimodal Hallucination * [0:59:12] Why Open Source Multimodality * [1:05:29] Naming: M4, OBELICS, IDEFICS * [1:08:56] 2024 Update from Leo Show Notes * Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model * IDEFICS Knowledge sharing memo: technical lessons and mistakes * Victor Sanh memo * OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents * Papers cited: * BLOOM: A 176B-Parameter Open-Access Multilingual Language Model * Barlow Twins: Self-Supervised Learning via Redundancy Reduction * CLIP paper: Learning Transferable Visual Models From Natural Language Supervision * Vision Transformers paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale * Flamingo paper: a Visual Language Model for Few-Shot Learning * April 2022 preprint from DeepMind, blogpost * VQAV2 paper: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering * OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (https://okvqa.allenai.org/) * MMBench: Is Your Multi-modal Model an All-around Player? * Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond * Sig-GLIP paper: Sigmoid Loss for Language Image Pre-Training * Nougat: Neural Optical Understanding for Academic Documents * MMC4 (Multimodal C4): An Open, Billion-scale Corpus of Images Interleaved With Text * Dall-E 3 paper: Improving Image Generation with Better Captions * GPT-4V(ision) system card from OpenAI * Query-Key Layernorm trick: paper (Scaling Vision Transformers to 22 Billion Parameters), tweet * EVA-CLIP: Improved Training Techniques for CLIP at Scale * “We intially explored using a significantly bigger vision encoder (the biggest in open-access at that time) with EVA-CLIP. However, we ran into training instabilities very quickly. To lower the risks associated to the change of vision encoder, we decided to continue with laion/CLIP-ViT-H-14-laion2B-s32B-b79K which we have been using until that point. We will leave that swap for future iterations and will also consider using higher resolution images.” * Datasets * Together’s RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models * LAION COCO: 600M synthetic captions from Laion2B-en * Chip Huyen’s writeup on LMMs * Joseph Nelson of Roboflow on Latent Space * HuggingFace M4 * HuggingFace timm: library containing SOTA computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/evaluation scripts. It comes packaged with >700 pretrained models, and is designed to be flexible and easy to use. * Logan Kilpatrick declaring 2024 the year of Multimodal AI at AI Engineer Summit This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
In 2023 we did a few Fundamentals episodes covering Benchmarks 101, Datasets 101, FlashAttention, and Transformers Math, and it turns out those were some of your evergreen favorites! So we are experimenting with more educational/survey content in the mix alongside our regular founder and event coverage. Pls request more! We have a new calendar for events; join to be notified of upcoming things in 2024! Today we visit the shoggoth mask factory: how do transformer models go from trawling a deeply learned latent space for next-token prediction to a helpful, honest, harmless chat assistant? Our guest “lecturer” today is Nathan Lambert ; you might know him from his prolific online writing on Interconnects and Twitter, or from his previous work leading RLHF at HuggingFace and now at the Allen Institute for AI (AI2) which recently released the open source GPT3.5-class Tulu 2 model which was trained with DPO. He’s widely considered one of the most knowledgeable people on RLHF and RLAIF. He recently gave an “RLHF 201” lecture at Stanford, so we invited him on the show to re-record it for everyone to enjoy! You can find the full slides here, which you can use as reference through this episode. Full video with synced slides For audio-only listeners, this episode comes with slide presentation along our discussion. You can find it on our YouTube (like, subscribe, tell a friend, et al). Theoretical foundations of RLHF The foundation and assumptions that go into RLHF go back all the way to Aristotle (and you can find guidance for further research in the slide below) but there are two key concepts that will be helpful in thinking through this topic and LLMs in general: * Von Neumann–Morgenstern utility theorem: you can dive into the math here, but the TLDR is that when humans make decision there’s usually a “maximum utility” function that measures what the best decision would be; the fact that this function exists, makes it possible for RLHF to model human preferences and decision making. * Bradley-Terry model: given two items A and B from a population, you can model the probability that A will be preferred to B (or vice-versa). In our world, A and B are usually two outputs from an LLM (or at the lowest level, the next token). It turns out that from this minimal set of assumptions, you can build up the mathematical foundations supporting the modern RLHF paradigm! The RLHF loop One important point Nathan makes is that "for many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior". For example, it might be difficult for you to write a poem, but it's really easy to say if you like or dislike a poem someone else wrote. Going back to the Bradley-Terry Model we mentioned, the core idea behind RLHF is that when given two outputs from a model, you will be able to say which of the two you prefer, and we'll then re-encode that preference into the model. An important point that Nathan mentions is that when you use these preferences to change model behavior "it doesn't mean that the model believes these things. It's just trained to prioritize these things". When you have preference for a model to not return instructions on how to write a computer virus for example, you're not erasing the weights that have that knowledge, but you're simply making it hard for that information to surface by prioritizing answers that don't return it. We'll talk more about this in our future Fine Tuning 101 episode as we break down how information is stored in models and how fine-tuning affects it. At a high level, the loop looks something like this: For many RLHF use cases today, we can assume the model we're training is already instruction-tuned for chat or whatever behavior the model is looking to achieve. In the "Reward Model & Other Infrastructure" we have multiple pieces: Reward + Preference Model The reward model is trying to signal to the model how much it should change its behavior based on the human preference, subject to a KL constraint. The preference model itself scores the pairwise preferences from the same prompt (worked better than scalar rewards). One way to think about it is that the reward model tells the model how big of a change this new preference should make in the behavior in absolute terms, while the preference model calculates how big of a difference there is between the two outputs in relative terms. A lot of this derives from John Schulman’s work on PPO: We recommend watching him talk about it in the video above, and also Nathan’s pseudocode distillation of the process: Feedback Interfaces Unlike the "thumbs up/down" buttons in ChatGPT, data annotation from labelers is much more thorough and has many axis of judgement. At a simple level, the LLM generates two outputs, A and B, for a given human conversation. It then asks the labeler to use a Likert scale to score which one it preferred, and by how much: Through the labeling process, there are many other ways to judge a generation: We then use all of this data to train a model from the preference pairs we have. We start from the base instruction-tuned model, and then run training in which the loss of our gradient descent is the difference between the good and the bad prompt. Constitutional AI (RLAIF, model-as-judge) As these models have gotten more sophisticated, people started asking the question of whether or not humans are actually a better judge of harmfulness, bias, etc, especially at the current price of data labeling. Anthropic's work on the "Constitutional AI" paper is using models to judge models. This is part of a broader "RLAIF" space: Reinforcement Learning from AI Feedback. By using a "constitution" that the model has to follow, you are able to generate fine-tuning data for a new model that will be RLHF'd on this constitution principles. The RLHF model will then be able to judge outputs of models to make sure that they follow its principles: Emerging Research RLHF is still a nascent field, and there are a lot of different research directions teams are taking; some of the newest and most promising / hyped ones: * Rejection sampling / Best of N Sampling: the core idea here is that rather than just scoring pairwise generations, you are generating a lot more outputs (= more inference cost), score them all with your reward model and then pick the top N results. LLaMA2 used this approach, amongst many others. * Process reward models: in Chain of Thought generation, scoring each step in the chain and treating it like its own state rather than just scoring the full output. This is most effective in fields like math that inherently require step-by-step reasoning. * Direct Preference Optimization (DPO): We covered DPO in our NeurIPS Best Papers recap, and Nathan has a whole blog post on this; DPO isn’t technically RLHF as it doesn’t have the RL part, but it’s the “GPU Poor” version of it. Mistral-Instruct was a DPO model, as do Intel’s Neural Chat and StableLM Zephyr. Expect to see a lot more variants in 2024 given how “easy” this was. * Superalignment: OpenAI launched research on weak-to-strong generalization which we briefly discuss at the 1hr mark. Note: Nathan also followed up this post with RLHF resources from his and peers’ work: Show Notes * Full RLHF Slides * Interconnects * Retort (podcast) * von Neumann-Morgenstern utility theorem * Bradley-Terry model (pairwise preferences model) * Constitutional AI * Tamer (2008 paper by Bradley Knox and Peter Stone) * Paul Christiano et al. RLHF paper * InstructGPT * Eureka by Jim Fan * ByteDance / OpenAI lawsuit * AlpacaEval * MTBench * TruthfulQA (evaluation tool) * Self-Instruct Paper * Open Assistant * Louis Castricato * Nazneen Rajani * Tulu (DPO model from the Allen Institute) Timestamps * [00:00:00] Introductions and background on the lecture origins * [00:05:17] History of RL and its applications * [00:10:09] Intellectual history of RLHF * [00:13:47] RLHF for decision-making and pre-deep RL vs deep RL * [00:20:19] Initial papers and intuitions around RLHF * [00:27:57] The three phases of RLHF * [00:31:09] Overfitting issues * [00:34:47] How preferences get defined * [00:40:35] Ballpark on LLaMA2 costs * [00:42:50] Synthetic data for training * [00:47:25] Technical deep dive in the RLHF process * [00:54:34] Projection / best event sampling * [00:57:49] Constitutional AI * [01:04:13] DPO * [01:08:54] What's the Allen Institute for AI? * [01:13:43] Benchmarks and models comparisons Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:15]: Hey, and today we have Dr. Nathan Lambert in the house. Welcome. Nathan [00:00:18]: Thanks guys. Swyx [00:00:19]: You didn't have to come too far. You got your PhD in Berkeley, and it seems like you've lived there most of the time in recent years. You worked on robotics and model-based reinforcement learning on your PhD, and you also interned at FAIR and DeepMind. You bootstrapped the RLHF team at Hugging Face, and you recently joined the Allen Institute as a research scientist. So that's your quick bio. What should people know about you that maybe is not super obvious about you on New LinkedIn? Nathan [00:00:43]: I stay sane in various insane sport and ultra-endurance sport activities that I do. Swyx [00:00:50]: What's an ultra-endurance sport activity? Nathan [00:00:52]: Long-distance trail running or gravel biking. Try to unplug sometimes, although it's harder these days. Yeah. Swyx [00:00:59]: Well, you know, just the Bay Area is just really good for that stuff, right? Nathan [00:01:02]: Oh, yeah. You can't beat it. I have a trailhead like 1.2 miles from my house, which is pretty unmatchable in any other urban area. Swyx [00:01:11]: Pretty excellent. You also have an incredible blog, Interconnects, which I'm a fan of. And I also
Happy 2024! We appreciated all the feedback on the listener survey (still open, link here)! Surprising to see that some people’s favorite episodes were others’ least, but we’ll always work on improving our audio quality and booking great guests. Help us out by leaving reviews on Twitter, YouTube, and Apple Podcasts! 🙏 Big thanks to Chris Anderson for the latest review - be like Chris! Note to the Audio-only Listener Because of the nature of today’s topic, it makes the most sense to follow along the demo on video rather than audio. There’s also about 30 mins of demos and technical detail that we had to remove from the audio version, because they didn’t make sense without video. Trailer here. Full 90min chat: (In other words, pls jump over and watch on our YouTube if you can! Did you know we are now posting every episode to YouTube? We’ve been multimodal for a long time!) Trend 1: GPT4-V Coding You might remember Greg Brockman’s hand-scribble-to-working-website demo from the GPT-4 demo from March. This was largely inaccessible to the rest of us until the GPT4-V API was released at Dev Day in November. As mentioned in our November 2023 recap, one of the biggest viral trends was tldraw’s open source “Make It Real” demo: starting from a simple wireframe and text annotations, you could create a real, functioning UI with the click of a button. Provoking another crisis of confidence in developer circles: And using state charts: And provoking responses from Excalidraw, a competitor. You can see us creating a Replit clone in this silent video here: Since our intervew the new GPT4V Coding metagame has been merging app UI’s and SQL with Supabase (another AIE Summit speaker) and other backend tools: * generating SQL * converting ERDs to SQL (part 2, for MariaDB) * seeding sample data * doing migrations Trend 2: Latent Consistency Models As covered in the Latent Space Paper Club in November, 3 papers drove a roughly ~100x acceleration in the speed of text to image generation over the past year: * Consistency Models (with Ilya Sutskever) * Latent Consistency Models (from Tsinghua) * LCM-LoRA (also Tsinghua, same authors) With the invaluable help of Fal.ai (friends of the show and AI Engineer Summit and progenitors of the viral GPU Rich/Poor hats mentioned on the Semianalysis episode), TLDraw has also been at the forefront of putting this research into production, with two projects: * drawfast: add a prompt, start sketching into the canvas and see each stroke affect the drawing. Overlap multiple of them to extend and merge drawings. * lens: a collaborative canvas where in real time people can draw and have their sketch turn into AI-generated art. Start drawing at the bottom and see it scroll into the magic canvas. For nontechnical people in your life, we do recommend showing them lens.tldraw.com (and its predecessor that we discuss on the show) on your and their mobile devices. The Rise of Multimodal Prompting At the first AI Engineer Summit in October, Logan (our first guest!) declared this the Year of Multimodality. Over the next 2 months we saw an explosion of activity in multimodal: GPT-4V’s API release at OpenAI Dev Day (our coverage here), LLaVA (our chat with author here on Visual Instruction Tuning), BakLLaVA, Qwen-VL, CogVLM, etc. On today’s episode we have Steve Ruiz, founder of tldraw. The project originally started as an open source whiteboard that Steve built for himself and then “accidentally made a really, really good visual multimodal prompting application environment”. Turns out that infinite canvas and generative models are a very good match: * Design is iterative: DALL-E, Midjourney, etc all work in a linear way: prompt goes in, 1-4 images come back. As you generate more, the previous images scroll away from your view. In a canvas environment, you can see the progression of your generation and visually “branch” by putting new prompts in different spaces. * UI has “layers”: when designing interfaces there are different layers to it: the functionality, the style, the state, etc. Some of what they are building in tldraw is bringing images into the canvas to influence different layers: “One thing that we've done is to bring in screenshots of other apps, like here's Stripe.com, like make it look like Stripe, you know? Or like here's Linear.com, like let's do it this way”. In the episode we spend a lot more time talking through all of these ideas and how Steve’s background in fine arts came back to being really useful in building a multi-modal AI canvas. Enjoy! Show Notes * tldraw * Open Source Repo * Make Real (Wireframe to UI) * drawfast.tldraw.com * lens.tldraw.com * Perfect Free Hand and Perfect Arrows * “Make Real, the story so far” * Dog CEO * Other whiteboarding products mentioned * Excalidraw * FigJam * Adobe Whiteboard * See also Steve’s interviews on the Slow Steady Pod and TWiSt, and subscribe to his tldraw substack! * TLDraw Wireframe kit * TLDraw LLM starter Timestamps * [00:00:00] Introductions * [00:01:02] Steve's Background In Fine Arts and Transition into Tech * [00:08:22] The creation of tldraw and its open source origin * [00:15:44] The Inception and Growth of tldraw * [00:18:40] The Integration of AI with tldraw and Make It Real Feature * [00:21:56] Discussion on Multimodal Prompting and Iterative Design * [00:32:32] The Concept of Parallel Prompting in Design * [00:34:11] Impact of AI on developer jobs * [00:37:28] Additional Layers in Multimodal Prompting * [00:45:18] Introduction of DrawFast and Lens Projects * [00:50:03] tldraw 2.0 and the future of the project * [00:55:41] The Competitive Landscape of Canvas Tools and tldraw's Unique Position * [01:00:22] Advice for Founders: Following Your Interests and Desires Transcript Swyx: Welcome back to Latent Space. I'm very excited to have my good friend, Steve Ruiz. How are you this morning? [00:00:13] Steve: Hey, how's it going? [00:00:14] Swyx: I have had the good fortune of knowing you before you got famous and actually hanging out in the precise office and studio that you're recording from right now. Congrats on Make It Real. Congrats on tldraw. I think it's been something that's sort of years in the making, but it's probably looks like overnight success to a lot of people. [00:00:32] Steve: Yeah. Thank you. It's kind of a funny story. I don't know. Where should we jump into it? [00:00:37] Swyx: Well, I like to give you a little background on the person. You don't have a lot of detail on LinkedIn, just like myself. I just found out just before recording that you're also a late entrance into tech. So maybe just like, what's your background coming into something like tldraw? What makes you so unique at doing sort of creative collaborative experiences like that? I know you and I've actually used tldraw, so I have some appreciation for how hard this thing is. [00:01:02] Steve: Yeah. Like you said, I kind of came into this a little late and kind of came into it from a weird angle. My background is actually in fine art and studio art. I have my master's from University of Chicago in visual art, would write about contemporary art and put together exhibitions and do my own paintings and drawings. And that was back when I was living in Chicago. And then when I moved over to the UK, you know, got a new studio, kept that going. But when I turned 30, I kind of decided I should probably make some money and work with other people closer than I was at the time. Studio art is primarily a solo thing. I'd always had kind of like an analytical kind of side to me. My day jobs were, you know, I was working for lawyers. I was doing this writing, like magazines and stuff. So when I did that kind of that switch back eventually to design and product design, I was also able to use a tiny little bit of technical skill that I had had just building like WordPress websites for myself and other artists as portfolios. Kind of take that, just some natural curiosity around the way that products work and kind of create a career direction that was more around prototyping and like technical design and kind of like doing the design on the bits of a product that really couldn't be designed otherwise. So the interactive bits, the bits which are maybe more, there's more questions about them. There's no clear answer to terms of like, how should this work? You know, in all those places, you kind of have to build something in order to, to figure out what you want to build. It turns out, you know, to skip right to the end for a moment, like canvas is full of those types of problems. So it's no surprise that I ended up there. It's like kind of an extreme form of the same problem. But yeah, so I was working, this was back in like 2017, 2018. And I used at the time a product called Framer. That was back when it was more of like a code product than what it is now, which is more of like a visual builder that is kind of backed by code. So I'm sort of just drilled into that. It was cool. Uber was using it. No one knew how it worked. No one could use it. So I got good at it and got a lot of advancement, early traction, whatever in my career based on that. But it also taught me to code, taught me to think about building things that other people are going to use. Taught me about kind of like the type of code that you write when you're in an exploratory phase rather than like in an execution, like production phase. And I actually ended up working for Framer. I did their education for a year, which was very different than the type of product design that I was doing before that. I did a lot of video tutorials and writing and tweeting, trying to figure out some way to make technical design content interesting, you know, in little chunks that people could consume. I joke that like they probably got less out of me in that job than I got out of the job itself. Like because, yeah, I walked away from that. Not sure if I'd helped
We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here. We can’t think of a more Latent-Space-y way to end 2023 than with a mega episode featuring many old and new friends recapping their biggest news, achievements, and themes and memes of the year! We previously covered the Best Papers of NeurIPS 2023, but the other part of NeurIPS being an industry friendly conference is all the startups that show up to hire and promote their latest and greatest products and papers! As a startup-friendly podcast, we of course were ready with our mics to talk to everyone we could track down. In lieu of an extended preamble, we encourage you to listen and click through all the interviews and show notes, all of which have been curated to match the references mentioned in the episode. Timestamps & Show Notes * [00:01:26] Jonathan Frankle - Chief Scientist, MosaicML/Databricks * see also the Mosaic/MPT-7B episode * $1.3B MosaicML x Databricks acquisition * [00:22:11] Lin Qiao - CEO, Fireworks AI * Fireworks Mixtral * [00:38:24] Aman Sanger - CEO, Anysphere (Cursor) * see also the Cursor episode * $8m seed from OpenAI * Tweet: Request-level memory-based KV caching * Tweet: GPT-4 grading and Trueskill ratings for rerankers * [00:51:14] Aravind Srinivas - CEO, Perplexity * 1m app installs on iOS and Android * pplx-online api 7b and 70b models * Shaan Puri/Paul Graham Fierce Nerds story * [01:04:26] Will Bryk - CEO, Metaphor * “Andrew Huberman may have singlehandedly ruined the SF social scene” * [01:12:49] Jeremy Howard - CEO, Answer.ai * see also the End of Finetuning episode * Jeremy’s podcast with Tanishq Abraham, Jess Leao * Announcing Answer.ai with $10m from Decibel VC * Laundry Buddy, Nov 2023 AI Meme of the Month * [01:37:13] Joel Hestness - Principal Scientist, Cerebras * CerebrasGPT, all the Cerebras papers we discussed * [01:56:34] Jason Corso - CEO, Voxel51 * Open Source FiftyOne project * CVPR Survival Guide * [02:02:39] Brandon Duderstadt - CEO, Nomic.ai * GPT4All, Atlas, Demo * [02:12:39] Luca Antiga - CTO, Lightning.ai * Pytorch Lightning, Lightning Studios, LitGPT * [02:29:46] Jay Alammar - Engineering Fellow, Cohere * The Illustrated Transformer This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here. NeurIPS 2023 took place from Dec 10–16 in New Orleans. The Latent Space crew was onsite for as many of the talks and workshops as we could attend (and more importantly, hosted cocktails and parties after hours)! Picking from the 3586 papers accepted to the conference (available online, full schedule here) is an impossible task, but we did our best to present an audio guide with brief commentary on each. We also recommend MLContests.com NeurIPS recap and Seb Ruder’s NeurIPS primer and Jerry Liu’s paper picks. We also found the VizHub guide useful for a t-SNE clustering of papers. Lots also happened in the arxiv publishing world outside NeurIPS, as highlighted by Karpathy, especially DeepMind’s Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. Jan 2024 update: we also strongly recommend Sebastian Raschka, PhD ‘s pick of the year’s 10 best papers, including Pythia. We’ll start with the NeurIPS Best Paper Awards, and then go to a selection of non-awarded but highly influential papers, and then arbitrary personal picks to round out the selection. Where we were able to do a poster session interview, please scroll to the relevant show notes for images of their poster for discussion. We give Chris Ré the last word due to the Mamba and StripedHyena state space models drawing particular excitement but still being too early to assess impact. Timestamps * [0:01:19] Word2Vec (Jeff Dean, Greg Corrado) * [0:15:28] Emergence Mirage (Rylan Schaeffer) * [0:28:48] DPO (Rafael Rafailov) * [0:41:36] DPO Poster Session (Archit Sharma) * [0:52:03] Datablations (Niklas Muennighoff) * [1:00:50] QLoRA (Tim Dettmers) * [1:12:23] DataComp (Samir Gadre) * [1:25:38] DataComp Poster Session (Samir Gadre, Alex Dimakis) * [1:35:25] LLaVA (Haotian Liu) * [1:47:21] LLaVA Poster Session (Haotian Liu) * [1:59:19] Tree of Thought (Shunyu Yao) * [2:11:27] Tree of Thought Poster Session (Shunyu Yao) * [2:20:09] Toolformer (Jane Dwivedi-Yu) * [2:32:26] Voyager (Guanzhi Wang) * [2:45:14] CogEval (Ida Momennejad) * [2:59:41] State Space Models (Chris Ré) Papers covered * Distributed Representations of Words and Phrases and their Compositionality (Word2Vec) Tomas Mikolov · Ilya Sutskever · Kai Chen · Greg Corrado · Jeff Dean. The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show that by subsampling frequent words we obtain significant speedup, and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of Canada'' and "Air'' cannot be easily combined to obtain "Air Canada''. Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model. * Some notable reflections from Tomas Mikolov - and debate over the Seq2Seq paper credit with Quoc Le * Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al.). Emergent abilities are abilities that are present in large-scale models but not in smaller models and are hard to predict. Rather than being a product of models’ scaling behavior, this paper argues that emergent abilities are mainly an artifact of the choice of metric used to evaluate them. Specifically, nonlinear and discontinuous metrics can lead to sharp and unpredictable changes in model performance. Indeed, the authors find that when accuracy is changed to a continuous metric for arithmetic tasks where emergent behavior was previously observed, performance improves smoothly instead. So while emergent abilities may still exist, they should be properly controlled and researchers should consider how the chosen metric interacts with the model. * Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al.) * While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. * In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. * Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train. See also Interconnects on DPO: and recent Twitter discussions * Scaling Data-Constrained Language Models (Muennighoff et al.) * The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations. * 2 minute poster session presentation video * QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al.). * This paper proposes QLoRA, a more memory-efficient (but slower) version of LoRA that uses several optimization tricks to save memory. They train a new model, Guanaco, that is fine-tuned only on a single GPU for 24h and outperforms previous models on the Vicuna benchmark. Overall, QLoRA enables using much fewer GPU memory for fine-tuning LLMs. Concurrently, other methods such as 4-bit LoRA quantization have been developed that achieve similar results. * DataComp: In search of the next generation of multimodal datasets (Gadre et al.) * Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. * Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www.datacomp.ai. * Visual Instruction Tuning (Liu et al) * Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. * By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistan
We are running an end of year survey for our listeners! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here! Listen to the end for a little surprise from Suhail. Before language models became all the rage in November 2022, image generation was the hottest space in AI (it was the subject of our first piece on Latent Space!) In our interview with Sharif Shameem from Lexica we talked through the launch of StableDiffusion and the early days of that space. At the time, the toolkit was still pretty rudimentary: Lexica made it easy to search images, you had the AUTOMATIC1111 Web UI to generate locally, some HuggingFace spaces that offered inference, and eventually DALL-E 2 through OpenAI’s platform, but not much beyond basic text-to-image workflows. Today’s guest, Suhail Doshi, is trying to solve this with Playground AI, an image editor reimagined with AI in mind. Some of the differences compared to traditional text-to-image workflows: * Real-time preview rendering using consistency: as you change your prompt, you can see changes in real-time before doing a final rendering of it. * Style filtering: rather than having to prompt exactly how you’d like an image to look, you can pick from a whole range of filters both from Playground’s model as well as Stable Diffusion (like RealVis, Starlight XL, etc). We talk about this at 25:46 in the podcast. * Expand prompt: similar to DALL-E3, Playground will do some prompt tuning for you to get better results in generation. Unlike DALL-E3, you can turn this off at any time if you are a prompting wizard * Image editing: after generation, you have tools like a magic eraser, inpainting pencil, etc. This makes it easier to do a full workflow in Playground rather than switching to another tool like Photoshop. Outside of the product, they have also trained a new model from scratch, Playground v2, which is fully open source and open weights and allows for commercial usage. They benchmarked the model against SDXL across 1,000 prompts and found that humans preferred the Playground generation 70% of the time. They had similar results on PartiPrompts: They also created a new benchmark, MJHQ-30K, for “aesthetic quality”: We introduce a new benchmark, MJHQ-30K, for automatic evaluation of a model’s aesthetic quality. The benchmark computes FID on a high-quality dataset to gauge aesthetic quality. We curate the high-quality dataset from Midjourney with 10 common categories, each category with 3K samples. Following common practice, we use aesthetic score and CLIP score to ensure high image quality and high image-text alignment. Furthermore, we take extra care to make the data diverse within each category. Suhail was pretty open with saying that Midjourney is currently the best product for imagine generation out there, and that’s why they used it as the base for this benchmark. I think it's worth comparing yourself to maybe the best thing and try to find like a really fair way of doing that. So I think more people should try to do that. I definitely don't think you should be kind of comparing yourself on like some Google model or some old SD, Stable Diffusion model and be like, look, we beat Stable Diffusion 1.5. I think users ultimately want care, how close are you getting to the thing that people mostly agree with? [00:23:47] We also talked a lot about Suhail’s founder journey from starting Mixpanel in 2009, then going through YC again with Mighty, and eventually sunsetting that to pivot into Playground. Enjoy! Show Notes * Suhail’s Twitter * “Starting my road to learn AI” * Bill Gates book trip * Playground * Playground v2 Announcement * $40M raise announcement * “Running infra dev ops for 24 A100s” * Mixpanel * Mighty * “I decided to stop working on Mighty” * Fast.ai * Civit Timestamps * [00:00:00] Intros * [00:02:59] Being early in ML at Mixpanel * [00:04:16] Pivoting from Mighty to Playground and focusing on generative AI * [00:07:54] How DALL-E 2 inspired Mighty * [00:09:19] Reimagining the graphics editor with AI * [00:17:34] Training the Playground V2 model from scratch to advance generative graphics * [00:21:11] Techniques used to improve Playground V2 like data filtering and model tuning * [00:25:21] Releasing the MJHQ30K benchmark to evaluate generative models * [00:30:35] The limitations of current models for detailed image editing tasks * [00:34:06] Using post-generation user feedback to create better benchmarks * [00:38:28] Concerns over potential misuse of powerful generative models * [00:41:54] Rethinking the graphics editor user experience in the AI era * [00:45:44] Integrating consistency models into Playground using preview rendering * [00:47:23] Interacting with the Stable Diffusion LoRAs community * [00:51:35] Running DevOps on A100s * [00:53:12] Startup ideas? Transcript Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:15] Swyx: Hey, and today in the studio we have Suhail Doshi, welcome. [00:00:18] Suhail: Yeah, thanks. Thanks for having me. [00:00:20] Swyx: So among many things, you're a CEO and co-founder of Mixpanel, and I think about three years ago you left to start Mighty, and more recently, I think about a year ago, transitioned into Playground, and you've just announced your new round. How do you like to be introduced beyond that? [00:00:34] Suhail: Just founder of Playground is fine, yeah, prior co-founder and CEO of Mixpanel. [00:00:40] Swyx: Yeah, awesome. I'd just like to touch on Mixpanel a little bit, because it's obviously one of the more successful analytics companies we previously had amplitude on, and I'm curious if you had any reflections on the interaction of that amount of data that people would want to use for AI. I don't know if there's still a part of you that stays in touch with that world. [00:00:59] Suhail: Yeah, I mean, the short version is that maybe back in like 2015 or 2016, I don't really remember exactly, because it was a while ago, we had an ML team at Mixpanel, and I think this is when maybe deep learning or something really just started getting kind of exciting, and we were thinking that maybe given that we had such vast amounts of data, perhaps we could predict things. So we built two or three different features, I think we built a feature where we could predict whether users would churn from your product. We made a feature that could predict whether users would convert, we built a feature that could do anomaly detection, like if something occurred in your product, that was just very surprising, maybe a spike in traffic in a particular region, can we tell you that that happened? Because it's really hard to like know everything that's going on with your data, can we tell you something surprising about your data? And we tried all of these various features, most of it boiled down to just like, you know, using logistic regression, and it never quite seemed very groundbreaking in the end. And so I think, you know, we had a four or five person ML team, and I think we never expanded it from there. And I did all these Fast AI courses trying to learn about ML. And that was the- That's the first time you did fast AI. Yeah, that was the first time I did fast AI. Yeah, I think I've done it now three times, maybe. [00:02:12] Swyx: Oh, okay. [00:02:13] Suhail: I didn't know it was the third. No, no, just me reviewing it, it's maybe three times, but yeah. [00:02:16] Swyx: You mentioned prediction, but honestly, like it's also just about the feedback, right? The quality of feedback from users, I think it's useful for anyone building AI applications. [00:02:25] Suhail: Yeah. Yeah, I think I haven't spent a lot of time thinking about Mixpanel because it's been a long time, but sometimes I'm like, oh, I wonder what we could do now. And then I kind of like move on to whatever I'm working on, but things have changed significantly since. [00:02:39] Swyx: And then maybe we'll touch on Mighty a little bit. Mighty was very, very bold. My framing of it was, you will run our browsers for us because everyone has too many tabs open. I have too many tabs open and slowing down your machines that you can do it better for us in a centralized data center. [00:02:51] Suhail: Yeah, we were first trying to make a browser that we would stream from a data center to your computer at extremely low latency, but the real objective wasn't trying to make a browser or anything like that. The real objective was to try to make a new kind of computer. And the thought was just that like, you know, we have these computers in front of us today and we upgrade them or they run out of RAM or they don't have enough RAM or not enough disk or, you know, there's some limitation with our computers, perhaps like data locality is a problem. Why do I need to think about upgrading my computer ever? And so, you know, we just had to kind of observe that like, well, actually it seems like a lot of applications are just now in the browser, you know, it's like how many real desktop applications do we use relative to the number of applications we use in the browser? So it's just this realization that actually like, you know, the browser was effectively becoming more or less our operating system over time. And so then that's why we kind of decided to go, hmm, maybe we can stream the browser. Fortunately, the idea did not work for a couple of different reasons, but the objective is try to make sure new computer. [00:03:50] Swyx: Yeah, very, very bold. [00:03:51] Alessio: Yeah, and I was there at YC Demo Day when you first announced it. It was, I think, the last or one of the last in-person ones, at Pier34 in Mission Bay. How do you think about that now when everybody wants to put some of these models in people's machines and some of them want to stream them in, do
We are running an end of year survey for our listeners. Let us know any feedback you have for us, what episodes resonated with you the most, and guest requests for 2024! RAG has emerged as one of the key pieces of the AI Engineer stack. Jerry from LlamaIndex called it a “hack”, Bryan from Hex compared it to “a recommendation system from LLMs”, and even LangChain started with it. RAG is crucial in any AI coding workflow. We talked about context quality for code in our Phind episode. Today’s guests, Beyang Liu and Steve Yegge from SourceGraph, have been focused on code indexing and retrieval for over 15 years. We locked them in our new studio to record a 1.5 hours masterclass on the history of code search, retrieval interfaces for code, and how they get SOTA 30% completion acceptance rate in their Cody product by being better at the “bin packing problem” of LLM context generation. Google Grok → SourceGraph → Cody While at Google in 2008, Steve built Grok, which lives on today as Google Kythe. It allowed engineers to do code parsing and searching across different codebases and programming languages. (You might remember the infamous Google Platforms Rant from Steve’s time at Google, and his 2021 followup on GCP). Beyang was an intern at Google at the same time, and Grok became the inspiration to start SourceGraph in 2013. The two didn’t know eachother personally until Beyang brought Steve out of retirement 9 years later to join him as VP Engineering. Fast forward 10 years, SourceGraph has become to best code search tool out there and raised $223M along the way. Nine months ago, they open sourced SourceGraph Cody, their AI coding assistant. All their code indexing and search infrastructure allows them to get SOTA results by having better RAG than competitors: * Code completions as you type that achieve an industry-best Completion Acceptance Rate (CAR) as high as 30% using a context-enhanced open-source LLM (StarCoder) * Context-aware chat that provides the option of using GPT-4 Turbo, Claude 2, GPT-3.5 Turbo, Mistral 7x8B, or Claude Instant, with more model integrations planned * Doc and unit test generation, along with AI quick fixes for common coding errors * AI-enhanced natural language code search, powered by a hybrid dense/sparse vector search engine There are a few pieces of infrastructure that helped Cody achieve these results: Dense-sparse vector retrieval system For many people, RAG = vector similarity search, but there’s a lot more that you can do to get the best possible results. From their release: "Sparse vector search" is a fancy name for keyword search that potentially incorporates LLMs for things like ranking and term expansion (e.g., "k8s" expands to "Kubernetes container orchestration", possibly weighted as in SPLADE): * Dense vector retrieval makes use of embeddings, the internal representation that LLMs use to represent text. Dense vector retrieval provides recall over a broader set of results that may have no exact keyword matches but are still semantically similar. * Sparse vector retrieval is very fast, human-understandable, and yields high recall of results that closely match the user query. * We've found the approaches to be complementary. There’s a very good blog post by Pinecone on SPLADE for sparse vector search if you’re interested in diving in. If you’re building RAG applications in areas that have a lot of industry-specific nomenclature, acronyms, etc, this is a good approach to getting better results. SCIP In 2016, Microsoft announced the Language Server Protocol (LSP) and the Language Server Index Format (LSIF). This protocol makes it easy for IDEs to get all the context they need from a codebase to get things like file search, references, “go to definition”, etc. SourceGraph developed SCIP, “a better code indexing format than LSIF”: * Simpler and More Efficient Format: SCIP utilizes Protobuf instead of JSON, which is used by LSIF. Protobuf is more space-efficient, simpler, and more suitable for systems programming. * Better Performance and Smaller Index Sizes: SCIP indexers, such as scip-clang, show enhanced performance and reduced index file sizes compared to LSIF indexers (10%-20% smaller) * Easier to Develop and Debug: SCIP's design, centered around human-readable string IDs for symbols, makes it faster and more straightforward to develop new language indexers. Having more efficient indexing is key to more performant RAG on code. Show Notes * Sourcegraph * Cody * Copilot vs Cody * Steve’s Stanford seminar on Grok * Steve’s blog * Grab * Fireworks * Peter Norvig * Noam Chomsky * Code search * Kelly Norton * Zoekt * v0.dev See also our past episodes on Cursor, Phind, Codeium and Codium as well as the GitHub Copilot keynote at AI Engineer Summit. Timestamps * [00:00:00] Intros & Backgrounds * [00:05:20] How Steve's work on Grok inspired SourceGraph for Beyang * [00:08:10] What's Cody? * [00:11:22] Comparison of coding assistants and the capabilities of Cody * [00:16:00] The importance of context (RAG) in AI coding tools * [00:21:33] The debate between Chomsky and Norvig approaches in AI * [00:30:06] Normsky: the Norvig + Chomsky models collision * [00:36:00] The death of the DSL? * [00:40:00] LSP, Skip, Kythe, BFG, and all that fun stuff * [00:53:00] The SourceGraph internal stack * [00:58:46] Building on open source models * [01:02:00] SourceGraph for engineering managers? * [01:12:00] Lightning Round Transcript Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:16] Swyx: Hey, and today we're christening our new podcast studio in the Newton, and we have Beyang and Steve from Sourcegraph. Welcome. [00:00:25] Beyang: Hey, thanks for having us. [00:00:26] Swyx: So this has been a long time coming. I'm very excited to have you. We also are just celebrating the one year anniversary of ChatGPT yesterday, but also we'll be talking about the GA of Cody later on today. We'll just do a quick intros of both of you. Obviously, people can research you and check the show notes for more. Beyang, you worked in computer vision at Stanford and then you worked at Palantir. I did, yeah. You also interned at Google. [00:00:48] Beyang: I did back in the day where I get to use Steve's system, DevTool. [00:00:53] Swyx: Right. What was it called? [00:00:55] Beyang: It was called Grok. Well, the end user thing was Google Code Search. That's what everyone called it, or just like CS. But the brains of it were really the kind of like Trigram index and then Grok, which provided the reference graph. [00:01:07] Steve: Today it's called Kythe, the open source Google one. It's sort of like Grok v3. [00:01:11] Swyx: On your podcast, which you've had me on, you've interviewed a bunch of other code search developers, including the current developer of Kythe, right? [00:01:19] Beyang: No, we didn't have any Kythe people on, although we would love to if they're up for it. We had Kelly Norton, who built a similar system at Etsy, it's an open source project called Hound. We also had Han-Wen Nienhuys, who created Zoekt, which is, I think, heavily inspired by the Trigram index that powered Google's original code search and that we also now use at Sourcegraph. Yeah. [00:01:45] Swyx: So you teamed up with Quinn over 10 years ago to start Sourcegraph and you were indexing all code on the internet. And now you're in a perfect spot to create a code intelligence startup. Yeah, yeah. [00:01:56] Beyang: I guess the backstory was, I used Google Code Search while I was an intern. And then after I left that internship and worked elsewhere, it was the single dev tool that I missed the most. I felt like my job was just a lot more tedious and much more of a hassle without it. And so when Quinn and I started working together at Palantir, he had also used various code search engines in open source over the years. And it was just a pain point that we both felt, both working on code at Palantir and also working within Palantir's clients, which were a lot of Fortune 500 companies, large financial institutions, folks like that. And if anything, the pains they felt in dealing with large complex code bases made our pain points feel small by comparison. So that was really the impetus for starting Sourcegraph. [00:02:42] Swyx: Yeah, excellent. Steve, you famously worked at Amazon. And you've told many, many stories. I want every single listener of Latent Space to check out Steve's YouTube because he effectively had a podcast that you didn't tell anyone about or something. You just hit record and just went on a few rants. I'm always here for your Stevie rants. And then you moved to Google, where you also had some interesting thoughts on just the overall Google culture versus Amazon. You joined Grab as head of eng for a couple of years. I'm from Singapore, so I have actually personally used a lot of Grab's features. And it was very interesting to see you talk so highly of Grab's engineering and sort of overall prospects. [00:03:21] Steve: Because as a customer, it sucked? [00:03:22] Swyx: Yeah, no, it's just like, being from a smaller country, you never see anyone from our home country being on a global stage or talked about as a startup that people admire or look up to, like on the league that you, with all your legendary experience, would consider equivalent. Yeah. [00:03:41] Steve: Yeah, no, absolutely. They actually, they didn't even know that they were as good as they were, in a sense. They started hiring a bunch of people from Silicon Valley to come in and sort of like fix it. And we came in and we were like, Oh, we could have been a little better operational excellence and stuff. But by and large, they're really sharp. The only thing about Grab is that they get criticized a lot for being too westernized. Oh, by who? By Singaporeans who don't want
The Latent Space crew will be at NeurIPS on Tuesday! Reach out with any parties and papers of interest. We have also been incubating a smol daily AI Newsletter and Latent Space University is making progress. Good open models like Llama 2 and Mistral 7B (which has just released an 8x7B MoE model) have enabled their own sub-industry of finetuned variants for a myriad of reasons: * Ownership & Control - you take responsibility for serving the models * Privacy - not having to send data to a third party vendor * Customization - Improving some attribute (censorship, multiturn chat and chain of thought, roleplaying) or benchmark performance (without cheating) Related to improving benchmark performance is the ability to use smaller (7B, 13B) models, by matching the performance of larger models, which have both cost and inference latency benefits. Core to all this work is finetuning, and the emergent finetuning library of choice has been Wing Lian’s Axolotl. Axolotl Axolotl is an LLM fine-tuner supporting SotA techniques and optimizations for a variety of common model architectures: It is used by many of the leading open source models: * Teknium: OpenHermes, Trismigestus, CollectiveCognition * OpenOrca: Mistral-OpenOrca, Mistral-SlimOrca * Nous Research: Puffin, Capybara, NousHermes * Pygmalion: Mythalion, Pygmalion * Eric Hartford: Dolphin, Samantha * DiscoResearch: DiscoLM 120B & 70B * OpenAccess AI Collective: Manticore, Minotaur, Jackalope, Hippogriff As finetuning is very formatting dependent, it also provides prompt interfaces and formatters between a range of popular model formats from Stanford’s Alpaca and Steven Tey’s ShareGPT (which led to Vicuna) to the more NSFW Pygmalion community. Nous Research Meetup We last talked about Nous at the DevDay Recap at the e/acc “banger rave”. We met Wing at the Nous Research meetup at the a16z offices in San Francisco, where they officially announced their company and future plans: Including Nous Forge: Show Notes We’ve already covered the nuances of Dataset Contamination and the problems with “Open Source” in AI, so we won’t rehash those topics here but do read/listen to those if you missed it. * Axolotl GitHub and Discord * The Flan paper and dataset * StackLlama model and blogpost * Multipack paper * Our episode with Tri Dao * Mamba state space models - Tri Dao and Albert Gu Timestamps * [00:00:00] Introducing Wing * [00:02:34] SF Open Source AI Meetup * [00:04:09] What is Axolotl? * [00:08:01] What is finetuning? * [00:08:52] Open Source Model Zoo * [00:10:53] Benchmarks and Contamination * [00:14:29] The Case for Open Source AI * [00:17:34] Orca and OpenOrca * [00:23:36] DiscoLM and Model Stacking * [00:25:07] Datasets and Evals over Models * [00:29:15] Distilling from GPT4 * [00:33:31] Finetuning - LoRA, QLoRA, ReLoRA, GPTQ * [00:41:55] Axolotl vs HF Transformers * [00:48:00] 20x efficiency with StackLlama and Multipack * [00:54:47] Tri Dao and Mamba * [00:59:08] Roadmap for Axolotl * [01:01:20] The Open Source AI Community Transcript [00:00:00] Introducing Wing Lian [00:00:00] [00:00:00] swyx: Welcome to Latent Space, a special edition with Wing Lien, but also with our new guest host, Alex. Hello, hello. Welcome, welcome. Again, needs no introduction. I think it's like your sixth time on Latent Space already. I think so, yeah. And welcome, Wing. We just met, but you've been very prolific online. Thanks for having me. [00:00:30] Yeah. So you are in town. You're not local. You're in town. You're from Minneapolis? [00:00:35] Wing Lian: Annapolis. Annapolis. It's funny because a lot of people think it's Indianapolis. It's I've got Minneapolis, but I used to live out at least in the San Francisco Bay Area years ago from like 2008 to 2014. So it's fairly familiar here. [00:00:50] swyx: Yep. You're the maintainer of Axolotl now, which we'll get into. You're very, very prolific in the open source AI community, and you're also the founder of the Open Access AI Collective. Yeah. Cool. Awesome. Maybe we can go over a little bit of your backgrounds into tech and then coming into AI, and then we'll cover what [00:01:06] Wing Lian: happens and why you're here. [00:01:08] Yeah. So. Back on tech, so I started years ago, I started way back when I was scraping, Apartment websites for listings and then, and then building like SEO optimized pages and then just throwing Google AdSense on it. [00:01:24] And that got me through like college basically. Is [00:01:27] swyx: that decent money? And what year [00:01:28] Wing Lian: was this? Like 2004, 2005. Yeah, that's decent money. It's like thousand bucks a month. But as a college student, that's like. Gravy. Really good money, right? So, and then there's just too much competition It's just sort of like died off. I was writing stuff in like Perl back then using like like who nobody hosted anything on Perl anymore, right? Still did a little bit more like computer tech support and then software, and web more professionally. [00:01:54] So I spent some time working on applications in the blood industry. I came out to San Francisco for, I was at SGN, so Social Gaming Network, as a startup. They started doing, with Facebook apps, and then they pivoted into doing mobile apps. And then, from there, I spent time. [00:02:14] I've quite a few more startups since then and in the last few years I've been in the music space So like I was at United Masters for a while and then past year I've been at SoundCloud, but not doing that anymore and now that I have a lot more time It's just like all right. [00:02:30] We're going full bore on axolotl and we're gonna we're gonna crush AI So yeah, [00:02:34] SF Open Source AI Meetup [00:02:34] swyx: totally you so you're here in town for the open source. Yeah, I meet up that we had yesterday Yep, yeah, that was amazing. Yeah, it was a big collection. Olama, Noose Research, Alignment Lab, Anyone else that I missed? I mean, Jeremy Howard is his own thing. [00:02:47] Yeah. [00:02:49] And Alex, you're also there. You love to bring SF to the world. Your takes? [00:02:55] Alex Volkov: It's incredible that we recorded a Thursday Eye episode after that one. And LDJ, who's usually co hosts Thursday Eye, just like briefly mentioned, Oh yeah, I talked about it. [00:03:04] Like, I saw Karpathy, and then I talked to Jeremy Howard, and the guy from Mistral came in, and it's like, He's talking about all these, titans of industry, basically, that outside of SF, You just don't meet casually hanging out in the same space. You can't, pull somebody. He ran into the Laylow from Mistral, he ran into him while, drinking water. [00:03:20] He didn't even know he was there. It's just, that type of stuff is really hard to find outside of SF. So, absolutely, absolutely great. And also, presentations from Alignment Labs, presentations from News Research, news issues, talked about. Forge, and some of [00:03:33] swyx: the other stuff they announced. We can say now they're officially a company. [00:03:36] I met Technium. [00:03:37] He [00:03:37] Alex Volkov: came over here. He didn't want to get recorded. But maybe. [00:03:41] Wing Lian: We'll wear him down at some point. Yeah, I'm excited for Forge. They've positioned it as this agentic sort of framework where it's just Drag and drop things and, fill in text with where you want to inject different variables and it opens up all of these potentials for data pipelines now, right? [00:03:56] And using your own local LLMs and not relying on GPT 4 or anything like that. Yeah, yeah, [00:04:02] swyx: good stuff. Okay, so let's maybe go into the Axolotl origin story and then we have, we have some intro or background. [00:04:09] What is Axolotl? [00:04:09] swyx: To do on like the open source model universe and also on fine tuning, but maybe just, since you're talking about your personal journey, what was your personal journey into [00:04:18] Wing Lian: axolotl? [00:04:19] Yeah, so my personal journey started like back in mid March, completely unrelated to AI and axolotl. And it really started, I fell while skiing, I torqued. Great 3 MCL sprain and being sort of like an active person that can no longer be active because the two, couldn't play soccer, because that is requires to have having knees until I, it's healed. [00:04:42] So I. I decided I needed to find something to do to take up my free time. And that became, well, let's learn how to train in, these language models. It was everywhere. So I was like, all right, I'm just going to sit down, learn. I think I used like other, I think I was using like Alpacalora. [00:05:00] Cause I think the Alpaca paper had just came out, come out then. So I was like using Alpacalora repo and sort of like learning how to use like. None of us were like GPU rich back then, and none of us, most of us still we're still all GPU poor, but I was doing what was it, like 4 bit, Alpaca Lord, there was like a 4 bit version where we were doing quant, or 8, no, 8 bit quantizations, and then I think they had released QLOR a little bit later, and I think right when, before QLOR came out, I was already starting to do fine tunes, but having this need to sort of like mix data sets together, and If you've ever looked at all the various different datasets available on HuggingFace, they all have various different prompt formats, and, it's sort of a nightmare, and then I think the other piece is if you've ever tried to fine tune, at least Back then probably the ecosystem's a little better now. [00:05:54] Everybody required that you say, alright, you put your hyperparameters as command line arguments. And so it's always like, well, I now have to go copy and paste my previous thing and to change things out. And I really wanted it. to be in a YAML file because it was more portable and reproducible. [00:06:09] So I was doing that and then the QLOR paper came out. Tim Dettmer announced that and then somebody looked it up for me yesterday and it's like between that ann
Catch us at Modular’s ModCon next week with Chris Lattner, and join our community! 2024 note: Hex is now hiring AI Engineers. Due to Bryan’s very wide ranging experience in data science and AI across Blue Bottle (!), StitchFix, Weights & Biases, and now Hex Magic, this episode can be considered a two-parter. Notebooks = Chat++ We’ve talked a lot about AI UX (in our meetups, writeups, and guest posts), and today we’re excited to dive into a new old player in AI interfaces: notebooks! Depending on your background, you either Don’t Like or you Like notebooks — they are the most popular example of Knuth’s Literate Programming concept, basically a collection of cells; each cell can execute code, display it, and share its state with all the other cells in a notebook. They can also simply be Markdown cells to add commentary to the analysis. Notebooks have a long history but most recently became popular from iPython evolving into Project Jupyter, and a wave of notebook based startups from Observable to DeepNote and Databricks sprung up for the modern data stack. The first wave of AI applications has been very chat focused (ChatGPT, Character.ai, Perplexity, etc). Chat as a user interface has a few shortcomings, the major one being the inability to edit previous messages. We enjoyed Bryan’s takes on why notebooks feel like “Chat++” and how they are building Hex Magic: * Atomic actions vs Stream of consciousness: in a chat interface, you make corrections by adding more messages to a conversation (i.e. “Can you try again by doing X instead?” or “I actually meant XYZ”). The context can easily get messy and confusing for models (and humans!) to follow. Notebooks’ cell structure on the other hand allows users to go back to any previous cells and make edits without having to add new ones at the bottom. * “Airlocks” for repeatability: one of the ideas they came up with at Hex is “airlocks”, a collection of cells that depend on each other and keep each other in sync. If you have a task like “Create a summary of my customers’ recent purchases”, there are many sub-tasks to be done (look up the data, sum the amounts, write the text, etc). Each sub-task will be in its own cell, and the airlock will keep them all in sync together. * Technical + Non-Technical users: previously you had to use Python / R / Julia to write notebooks code, but with models like GPT-4, natural language is usually enough. Hex is also working on lowering the barrier of entry for non-technical users into notebooks, similar to how Code Interpreter is doing the same in ChatGPT. Obviously notebooks aren’t new for developers (OpenAI Cookbooks are a good example), but haven’t had much adoption in less technical spheres. Some of the shortcomings of chat UIs + LLMs lowering the barrier of entry to creating code cells might make them a much more popular UX going forward. RAG = RecSys! We also talked about the LLMOps landscape and why it’s an “iron mine” rather than a “gold rush”: I'll shamelessly steal [this] from a friend, Adam Azzam from Prefect. He says that [LLMOps] is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. Don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this resource to something valuable is significant. Some of my favorite takeaways: * RAG as RecSys for LLMs: at its core, the goal of a RAG pipeline is finding the most relevant documents based on a task. This isn’t very different from traditional recommendation system products that surface things for users. How can we apply old lessons to this new problem? Bryan cites fellow AIE Summit speaker and Latent Space Paper Club host Eugene Yan in decomposing the retrieval problem into retrieval, filtering, and scoring/ranking/ordering: As AI Engineers increasingly find that long context has tradeoffs, they will also have to relearn age old lessons that vector search is NOT all you need and a good systems not models approach is essential to scalable/debuggable RAG. Good thing Bryan has just written the first O’Reilly book about modern RecSys, eh? * Narrowing down evaluation: while “hallucination” is a easy term to throw around, the reality is more nuanced. A lot of times, model errors can be automatically fixed: is this JSON valid? If not, why? Is it just missing a closing brace? These smaller issues can be checked and fixed before returning the response to the user, which is easier than fixing the model. * Fine-tuning isn’t all you need: when they first started building Magic, one of the discussions was around fine-tuning a model. In our episode with Jeremy Howard we talked about how fine-tuning leads to loss of capabilities as well. In notebooks, you are often dealing with domain-specific data (i.e. purchases, orders, wardrobe composition, household items, etc); the fact that the model understands that “items” are probably part of an “order” is really helpful. They have found that GPT-4 + 3.5-turbo were everything they needed to ship a great product rather than having to fine-tune on notebooks specifically. Definitely recommend listening to this one if you are interested in getting a better understanding of how to think about AI, data, and how we can use traditional machine learning lessons in large language models. The AI Pivot For more Bryan, don’t miss his fireside chat at the AI Engineer Summit: Show Notes * Hex Magic * Bryan’s new book: Building Recommendation Systems in Python and JAX * Bryan’s whitepaper about MLOps * “Kitbashing in ML”, slides from his talk on building on top of foundation models * “Bayesian Statistics The Fun Way” by Will Kurt * Bryan’s Twitter * “Berkeley man determined to walk every street in his city” * People: * Adam Azzam * Graham Neubig * Eugene Yan * Even Oldridge Timestamps * [00:00:00] Bryan’s background * [00:02:34] Overview of Hex and the Magic product * [00:05:57] How Magic handles the complex notebook format to integrate cleanly with Hex * [00:08:37] Discussion of whether to build vs buy models - why Hex uses GPT-4 vs fine-tuning * [00:13:06] UX design for Magic with Hex's notebook format (aka “Chat++”) * [00:18:37] Expanding notebooks to less technical users * [00:23:46] The "Memex" as an exciting underexplored area - personal knowledge graph and memory augmentation * [00:27:02] What makes for good LLMops vs MLOps * [00:34:53] Building rigorous evaluators for Magic and best practices * [00:36:52] Different types of metrics for LLM evaluation beyond just end task accuracy * [00:39:19] Evaluation strategy when you don't own the core model that's being evaluated * [00:41:49] All the places you can make improvements outside of retraining the core LLM * [00:45:00] Lightning Round Transcript Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO-in-Residence of Decibel Partners, and today I'm joining by Bryan Bischof. [00:00:15] Bryan: Hey, nice to meet you. [00:00:17] Alessio: So Bryan has one of the most thorough and impressive backgrounds we had on the show so far. Lead software engineer at Blue Bottle Coffee, which if you live in San Francisco, you know a lot about. And maybe you'll tell us 30 seconds on what that actually means. You worked as a data scientist at Stitch Fix, which used to be one of the premier data science teams out there. [00:00:38] Bryan: It used to be. Ouch. [00:00:39] Alessio: Well, no, no. Well, you left, you know, so how good can it still be? Then head of data science at Weights and Biases. You're also a professor at Rutgers and you're just wrapping up a new O'Reilly book as well. So a lot, a lot going on. Yeah. [00:00:52] Bryan: And currently head of AI at Hex. [00:00:54] Alessio: Let's do the Blue Bottle thing because I definitely want to hear what's the, what's that like? [00:00:58] Bryan: So I was leading data at Blue Bottle. I was the first data hire. I came in to kind of get the data warehouse in order and then see what we could build on top of it. But ultimately I mostly focused on demand forecasting, a little bit of recsys, a little bit of sort of like website optimization and analytics. But ultimately anything that you could imagine sort of like a retail company needing to do with their data, we had to do. I sort of like led that team, hired a few people, expanded it out. One interesting thing was I was part of the Nestle acquisition. So there was a period of time where we were sort of preparing for that and didn't know, which was a really interesting dynamic. Being acquired is a very not necessarily fun experience for the data team. [00:01:37] Alessio: I build a lot of internal tools for sourcing at the firm and we have a small VCs and data community of like other people doing it. And I feel like if you had a data feed into like the Blue Bottle in South Park, the Blue Bottle at the Hanahaus in Palo Alto, you can get a lot of secondhand information on the state of VC funding. [00:01:54] Bryan: Oh yeah. I feel like the real source of alpha is just bugging a Blue Bottle. [00:01:58] Alessio: Exactly. And what's your latest book about? [00:02:02] Bryan: I just wrapped up a book with a coauthor Hector Yee called Building Production Recommendation Systems. I'll give you the rest of the title because it's fun. It's in Python and JAX. And so for those of you that are like eagerly awaiting the first O'Reilly book that focuses on JAX, here you go. [00:02:17] Alessio: Awesome. And we'll chat about that later on. But let's maybe talk about Hex and Magic before. I've known Hex for a while, I've used it as a notebook provider and you've been working on a lot of amazing AI enabled experiences. So maybe run us through that. [00:02:34] Bryan: So I too, before I sort of like joined Hex, saw it as this like really incredible notebook platform,
This episode came together at ~4 hrs notice since Dylan had just landed in SF and we had to setup quickly; you might notice some small audio issues in some segments, we apologize. We’re currently building our own podcast studio for 2024! 🙏 We’re ramping up our presence on Twitter and YouTube if you’d like to support us. Note: 17k people joined our emergency pod on Sam Altman’s ouster today. If Charles Dickens was alive in 2024, A Tale of Two Cities might be the divide between the “GPU poor” and the “GPU rich”. We mentioned these terms in some of our previous episodes; they were originally coined by Dylan Patel of SemiAnalysis in his “Gemini Eats the World” post, put on blast by Sam Altman. SemiAnalysis are one of the most in depth research and consulting firms in the semis world, and have a unique insight into the design, production, and supply chain of GPUs based on their ground presence in Asia. In this episode we break down the State of Silicon: when are more GPUs coming? Are there real GPU alternatives on the way? Should Microsoft buy AMD chips just to scare Jensen? Is there a “GPU poor is beautiful” manifesto? The supply wave is coming The GPU shortage is the talk of the town in the Bay Area, but next year looks a lot better in terms of AI accelerating capacity: * NVIDIA is forecasted to sell over 3 million GPUs next year, about 3x their 2023 sales of about 1 million H100s. * AMD is forecasting $2B of sales for their new MI300X datacenter GPU. They are also indirectly getting a boost from the work that companies like Modular and tiny are doing in making it easier to actually use these chips (will ROCm ever catch up?) * Google’s TPUv5 supply is going to increase rapidly going into 2024 * Microsoft just announced Maia 100, a new AI accelerator built “with feedback” from OpenAI. In the episode we dove deeper into what this means for each of these companies and the GPU consumers, but the TLDR (sadly) is that capacity increases but FLOPS requirements to train the next generation of models will eclipse the one of previous ones. GPT-3 was 4,000x more FLOPS than GPT-2. Dylan estimates GPT-4 was trained on 20,000 A100s for ~$500M all-in; how much will OpenAI spend to train GPT-5? How many GPUs will need to go brrr? In the meantime, the amount of companies looking for GPUs has increased, with Meta rising as one of the de-facto top 3 AI labs in terms of capacity. The pressure to acquire more chips will not ease in 2024. We also talked about some of the companies trying to displace traditional GPU architectures: MatX, Lemurian Labs, Cerebras, etc. The different variables they are fighting on are size of SRAM vs HBM, focusing on memory bandwidth vs memory size, different math representation for kernels, etc, and how the key to this market is whether or not the transformer architecture will still be the #1 in the future. Surviving in the GPU Poor lane A lot of the smaller companies (when compared to $1T+ giants, it’s all relative) are trying hard to fight against the GPU rich, but they can’t quite offer the same scale: * HuggingFace is trying to launch a training cluster as a service, but it seems to just be a software wrapper around NVIDIA’s GDX Cloud, as they don’t actually own that much GPU supply. The max option for GPUs to use is 1,000 in their form. * Databricks’ “GPU-enabled clusters” run on AWS, and the largest one listed there is only powered by 8 NVIDIA A10Gs. The Mosaic team is also doing research on running on AMD cards with some promising results, but they seem to be pushing up to just 128 cards, which isn’t much. * Together actually has 4,424 H100s live in production, which is quite sizable but still nothing compared to the 100,000 that Meta is putting online. Take LLaMA2 as an example; the 70B model was trained on 2T tokens. Using the highest accelerator count on HuggingFace it’d take ~43 days to train the model from scratch and it’d cost ~$2M. That doesn’t include all the data and prep work. In the meantime, Zuck is probably burning tens of thousands of H100s to train LLaMA3, which will surely have much higher performance than whatever a GPU poor company can train in the same time span. The good news, is that there’s a ton of opportunity for the GPU poors to shine, especially around fine-tuning. Most of the open source models coming out are one-size-fits-all, and there’s a ton of opportunity for startups to take them and tailor them to their customers, or to specific tasks or use cases to build vertical applications. The other area of improvement is data quality; Mistral showed how you can build a high quality small model with less FLOPs by feeding it better data. The key to differentiation won’t be GPUs, but tokens. Show Notes * SemiAnalysis * Google Gemini Eats The World – Gemini Smashes GPT-4 By 5X, The GPU-Poors * How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0 * AMD MI300 – Taming The Hype – AI Performance, Volume Ramp, Customers, Cost, IO, Networking, Software * @sama: incredible google got that semianalysis guy to publish their internal marketing/recruiting chart lol * Mellanox * MatX * Lemurian Labs * Cerebras * For SRAM / HBM, see our FlashAttention episode * Suggested readings: * Moore's Law: The Life of Gordon Moore, Silicon Valley's Quiet Revolutionary * Chip War by Chris Miller Chapters * Introduction [00:00:00] * Importance of infrastructure for tech companies [00:01:11] * Training costs are irrelevant [00:03:06] * Worldview of GPU-poor vs GPU-rich [00:04:01] * Google's TPU infrastructure [00:08:12] * Alternative hardware like Cerebras and Graphcore [00:17:37] * Partnerships between labs and hardware companies [00:37:15] * Apple's potential in AI [00:40:56] * Concerns over China and Taiwan [00:41:02] * Feasibility of rebuilding the semiconductor supply chain in the US [00:43:22] * Foundational semiconductor readings [00:46:09] * NVIDIA's pivot to AI [00:47:40] * Dylan's writing process [00:48:17] * Using multiple data centers for distributed AI training [00:52:36] Transcript Alessio: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners. I'm joined by my co-host Swyx, founder of Smol AI. [00:00:16] Swyx: And today we have Dylan Patel and welcome. So you are the author of the extremely popular Semi-Analysis blog. We have both had a little bit of claim to fame in breaking details of GPT-4. George Hotz came on our pod and talked about the mixture of experts thing and then you had a lot more detail. [00:00:29] Dylan: To be clear, I talked about mixture of experts in January, it's just people didn't really notice it. Yeah. I guess. [00:00:35] Swyx: I don't know. You went into a lot more detail and I'd love to dig into some of that. [00:00:38] Dylan: Yeah. Thank you so much. I've been doing consulting in the industry, semiconductor industry since 17. 2021 got bored and in November I started writing a blog and then like 2022 I was good and started hiring folks for my firm. And then all of a sudden 2023 happens and it's like the perfect intersection. I used to do data science, but not like AI, not really like multivariable progression is not AI. Right. But also I've been involved in the semiconductor industry for a long, long time, posting about it online since I was 12. Right. You know, all of a sudden this all kind of came to fruition. So it's cool to have the blog sort of blow up in that way. [00:01:11] Swyx: I used to cover semis at Belyasny as well. And it was for a long time, it was just the mobile cycle. And then a little bit of PCs, but like not that much. And then maybe some cloud stuff, you know, like public cloud, you know, semiconductor stuff. But it really wasn't anything until this wave. And I was actually listening to you on one of the previous podcasts that you've done. And it was surprising that high-performance computing also kind of didn't really take off. Like AI is just the first form of high-performance computing that worked. [00:01:37] Dylan: One of the theses I've had for a long time that I think people haven't really caught on, but it's coming to fruition now is that the largest tech companies in the world, their software is important, but actually having and operating a very efficient infrastructure is incredibly important. And so, you know, people talk about, you know, hey, Amazon is great, AWS is great because yes, it is easy to use and they've built all these things. But behind the scenes, they've done a lot on the infrastructure that is super custom that Microsoft, Azure and Google Cloud just don't even match in terms of efficiency. If you think about the cost to rent out SSD space, so the cost to rent, you know, offer database service on top of that, obviously, a cost to rent out a certain level of CPU performance. Amazon has a massive advantage there. And likewise, like Google spent all this time doing that in AI, right, with their TPUs and infrastructure there and an optical switches and all this sort of stuff. And so in the past, it wasn't immediately obvious. I think with AI, especially like how scaling laws are going, it's like incredibly important for infrastructure is like so much more important. And then like when you just think about software cost, right, like the cost structure of it, there was always a bigger component of R&D and like SAS businesses, you know, all over SF, all these SAS businesses did crazy good because, you know, they just start as they grow and then all of a sudden they're so freaking profitable for each incremental new customer. And AI software looks like it's going to be very different, in my opinion, right? Like the R&D cost is much lower in terms of people, but the cost of goods sold in terms of actually operating the service, I think will be much higher. And so in that same sense, infrastructure matters a ton. [00:03:02] Swyx: And I t
We left a high amount of background audio in the Devday podcast, which many of you loved, but we definitely understand that some of you may have had trouble with it. Listener Klaus Breyer ran it through Auphonic with speech islolation and we figured we’d upload it as a backdated pod for people who prefer this. Of course it means that our speakers sound out of place since they now sound like they are talking loudly in a quiet room. Let us know in the comments what you think? Timestamps the cleaned part is only part 2: * [00:55:09] Part II: Spot Interviews * [00:55:59] Jim Fan (Nvidia) - High Level Takeaways * [01:05:19] Raza Habib (Humanloop) - Foundation Model Ops * [01:13:32] Surya Dantuluri (Stealth) - RIP Plugins * [01:20:53] Reid Robinson (Zapier) - AI Actions for GPTs * [01:30:45] Div Garg (MultiOn) - GPT4V for Agents * [01:36:42] Louis Knight-Webb (Bloop.ai) - AI Code Search * [01:48:36] Shreya Rajpal (Guardrails) - Guardrails for LLMs * [01:59:00] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open" * [02:09:39] Rahul Sonwalkar (Julius AI) - Advice for Founders This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
SF folks: join us at the AI Engineer Foundation’s Emergency Hackathon tomorrow and consider the Newton if you’d like to cowork in the heart of the Cerebral Arena. Our community page is up to date as usual! ~800,000 developers watched OpenAI Dev Day, ~8,000 of whom listened along live on our ThursdAI x Latent Space, and ~800 of whom got tickets to attend in person: OpenAI’s first developer conference easily surpassed most people’s lowballed expectations - they simply did everything short of announcing GPT-5, including: * ChatGPT (the consumer facing product) * GPT4 Turbo already in ChatGPT (running faster, with an April 2023 cutoff), all noticed by users weeks before the conference * Model picker eliminated, God Model chooses for you * GPTs - “tailored version of ChatGPT for a specific purpose” - stopping short of “Agents”. With custom instructions, expanded knowledge, and actions, and an intuitive no-code GPT Builder UI (we tried all these on our livestream yesterday and found some issues, but also were able to ship interesting GPTs very quickly) and a GPT store with revenue sharing (an important criticism we focused on in our episode on ChatGPT Plugins) * API (the developer facing product) * APIs for Dall-E 3, GPT4 Vision, Code Interpreter (RIP Advanced Data Analysis), GPT4 Finetuning and (surprise!) Text to Speech * many thought each of these would take much longer to arrive * usable in curl and in playground * BYO Interpreter + Async Agents? * Assistant API: stateful API backing “GPTs” like apps, with support for calling multiple tools in parallel, persistent Threads (storing message history, unlimited context window with some asterisks), and uploading/accessing Files (with a possibly-too-simple RAG algorithm, and expensive pricing) * Whisper 3 announced and open sourced (HuggingFace recap) * Price drops for a bunch of things! * Misc: Custom Models for big spending ($2-3m) customers, Copyright Shield, Satya The progress here feels fast, but it is mostly (incredible) last-mile execution on model capabilities that we already knew to exist. On reflection it is important to understand that the one guiding principle of OpenAI, even more than being Open (we address that in part 2 of today’s pod), is that slow takeoff of AGI is the best scenario for humanity, and that this is what slow takeoff looks like: When introducing GPTs, Sam was careful to assert that “gradual iterative deployment is the best way to address the safety challenges with AI”: This is why, in fact, GPTs and Assistants are intentionally underpowered, and it is a useful exercise to consider what else OpenAI continues to consider dangerous (for example, many people consider a while(true) loop a core driver of an agent, which GPTs conspicuously lack, though Lilian Weng of OpenAI does not). We convened the crew to deliver the best recap of OpenAI Dev Day in Latent Space pod style, with a 1hr deep dive with the Functions pod crew from 5 months ago, and then another hour with past and future guests live from the venue itself, discussing various elements of how these updates affect their thinking and startups. Enjoy! Show Notes * swyx live thread (see pinned messages in Twitter Space for extra links from community) * Newton AI Coworking Interest Form in the heart of the Cerebral Arena Timestamps * [00:00:00] Introduction * [00:01:59] Part I: Latent Space Pod Recap * [00:06:16] GPT4 Turbo and Assistant API * [00:13:45] JSON mode * [00:15:39] Plugins vs GPT Actions * [00:16:48] What is a "GPT"? * [00:21:02] Criticism: the God Model * [00:22:48] Criticism: ChatGPT changes * [00:25:59] "GPTs" is a genius marketing move * [00:26:59] RIP Advanced Data Analysis * [00:28:50] GPT Creator as AI Prompt Engineer * [00:31:16] Zapier and Prompt Injection * [00:34:09] Copyright Shield * [00:38:03] Sharable GPTs solve the API distribution issue * [00:39:07] Voice * [00:44:59] Vision * [00:49:48] In person experience * [00:55:11] Part II: Spot Interviews * [00:56:05] Jim Fan (Nvidia - High Level Takeaways) * [01:05:35] Raza Habib (Humanloop) - Foundation Model Ops * [01:13:59] Surya Dantuluri (Stealth) - RIP Plugins * [01:21:20] Reid Robinson (Zapier) - AI Actions for GPTs * [01:31:19] Div Garg (MultiOn) - GPT4V for Agents * [01:37:15] Louis Knight-Webb (Bloop.ai) - AI Code Search * [01:49:21] Shreya Rajpal (Guardrails.ai) - on Hallucinations * [01:59:51] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open" * [02:10:26] Rahul Sonwalkar (Julius AI) - Advice for Founders Transcript [00:00:00] Introduction [00:00:00] swyx: Hey everyone, this is Swyx coming at you live from the Newton, which is in the heart of the Cerebral Arena. It is a new AI co working space that I and a couple of friends are working out of. There are hot desks available if you're interested, just check the show notes. But otherwise, obviously, it's been 24 hours since the opening of Dev Day, a lot of hot reactions and longstanding tradition, one of the longest traditions we've had. [00:00:29] And the latent space pod is to convene emergency sessions and record the live thoughts of developers and founders going through and processing in real time. I think a lot of the roles of podcasts isn't as perfect information delivery channels, but really as an audio and oral history of what's going on as it happens, while it happens. [00:00:49] So this one's a little unusual. Previously, we only just gathered on Twitter Spaces, and then just had a bunch of people. The last one was the Code Interpreter one with 22, 000 people showed up. But this one is a little bit more complicated because there's an in person element and then a online element. [00:01:06] So this is a two part episode. The first part is a recorded session between our latent space people and Simon Willison and Alex Volkoff from the Thursday iPod, just kind of recapping the day. But then also, as the second hour, I managed to get a bunch of interviews with previous guests on the pod who we're still friends with and some new people that we haven't yet had on the pod. [00:01:28] But I wanted to just get their quick reactions because most of you have known and loved Jim Fan and Div Garg and a bunch of other folks that we interviewed. So I just want to, I'm excited to introduce To you the broader scope of what it's like to be at OpenAI Dev Day in person bring you the audio experience as well as give you some of the thoughts that developers are having as they process the announcements from OpenAI. [00:01:51] So first off, we have the Mainspace Pod recap. One hour of open I dev day. [00:01:59] Part I: Latent Space Pod Recap [00:01:59] Alessio: Hey. Welcome to the Latents Based Podcast an emergency edition after OpenAI Dev Day. This is Alessio, partner and CTO of Residence at Decibel Partners, and as usual, I'm joined by Swyx, founder of SmallAI. Hey, [00:02:12] swyx: and today we have two special guests with us covering all the latest and greatest. [00:02:17] We, we, we love to get our band together and recap things, especially when they're big. And it seems like that every three months we have to do this. So Alex, welcome. From Thursday AI we've been collaborating a lot on the Twitter spaces and welcome Simon from many, many things, but also I think you're the first person to not, not make four appearances on our pod. [00:02:37] Oh, wow. I feel privileged. So welcome. Yeah, I think we're all there yesterday. How... Do we feel like, what do you want to kick off with? Maybe Simon, you want to, you want to take first and then Alex. Sure. Yeah. I mean, [00:02:47] Simon Willison: yesterday was quite exhausting, quite frankly. I feel like it's going to take us as a community several months just to completely absorb all of the stuff that they dropped on us in one giant. [00:02:57] Giant batch. It's particularly impressive considering they launched a ton of features, what, three or four weeks ago? ChatGPT voice and the combined mode and all of that kind of thing. And then they followed up with everything from yesterday. That said, now that I've started digging into the stuff that they released yesterday, some of it is clearly in need of a bit more polish. [00:03:15] You know, the the, the reality of what they look, what they released is I'd say about 80 percent of, of what it looks like it was yesterday, which is still impressive. You know, don't get me wrong. This is an amazing batch of stuff, but there are definitely problems and sharp edges that we need to file off. [00:03:29] And there are things that we still need to figure out before we can take advantage of all of this. [00:03:33] swyx: Yeah, agreed, agreed. And we can go into those, those sharp edges in a bit. I just want to pop over to Alex. What are your thoughts? [00:03:39] Alex Volkov: So, interestingly, even folks at OpenAI, there's like several booths and help desks so you can go in and ask people, like, actual changes and people, like, they could follow up with, like, the right people in OpenAI and, like, answer you back, etc. [00:03:52] Even some of them didn't know about all the changes. So I went to the voice and audio booth. And I asked them about, like, hey, is Whisper 3 that was announced by Sam Altman on stage just, like, briefly, will that be open source? Because I'm, you know, I love using Whisper. And they're like, oh, did we open source? [00:04:06] Did we talk about Whisper 3? Like, some of them didn't even know what they were releasing. But overall, I felt it was a very tightly run event. Like, I was really impressed. Shawn, we were sitting in the audience, and you, like, pointed at the clock to me when they finished. They finished, like, on... And this was after like doing some extra stuff. [00:04:24] Very, very impressive for a first event. Like I was absolutely like, Good job. [00:04:30] swyx: Yeah, apparently it was their first keynote and someone, I think, was it you that told me that this is what happens if you have A president of Y Comb
At the AI Pioneers Summit we announced Latent Space Launchpad, an AI-focused accelerator in partnership with Decibel. If you’re an AI founder of enterprise early adopter, fill out this form and we’ll be in touch with more details. We also have a lot of events coming up as we wrap up the year, so make sure to check out our community events page and come say hi! We previously interviewed the founders of many developer productivity startups embedded in the IDE, like Codium AI, Cursor, and Codeium. We also covered Replit’s (former) SOTA model, replit-code-v1-3b and most recently had Amjad and Michele announce replit-code-v1_5-3b at the AI Engineer Summit. Much has been speculated about the StackOverflow traffic drop since ChatGPT release, but the experience is still not perfect. There’s now a new player in the “search for developers” arena: Phind. Phind’s goal is to help you find answers to your technical questions, and then help you implement them. For example “What should I use to create a frontend for a Python script?” returns a list of frameworks as well as links to the sources. You can then ask follow up questions on specific implementation details, having it write some code for you, etc. They have both a web version and a VS Code integration They recently were top of Hacker News with the announcement of their latest model, which is now the #1 rated model on the BigCode Leaderboard, beating their previous version: TLDR Cheat Sheet: * Based on CodeLlama-34B, which is trained on 500B tokens * Further fine-tuned on 70B+ high quality code and reasoning tokens * Expanded context window to 16k tokens * 5x faster than GPT-4 (100 tok/s vs 20 tok/s on single stream) * 74.7% HumanEval vs 45% for the base model We’ve talked before about HumanEval being limited in a lot of cases and how it needs to be complemented with “vibe based” evals. Phind thinks of evals alongside two axis: * Context quality: when asking the model to generate code, was the context high quality? Did we put outdated examples in it? Did we retrieve the wrong files? * Result quality: was the code generated correct? Did it follow the instructions I gave it or did it misunderstand some of it? If you have bad results with bad context, you might get to a good result by working on better RAG. If you have good context and bad result you might either need to work on your prompting or you have hit the limits of the model, which leads you to fine tuning (like they did). Michael was really early to this space and started working on CommonCrawl filtering and indexing back in 2020, which led to a lot of the insights that now power Phind. We talked about that evolution, his experience at YC, how he got Paul Graham to invest in Phind and invite him to dinner at his house, and how Ron Conway connected him with Jensen Huang to get access to more GPUs! Show Notes * Phind * BigScience T0 * InstructGPT Paper * Inception-V3 * LMQL * Marginalia Nu * Mistral AI * People: * Paul Graham (pg) * Ron Conway * Yacine Jernite from HuggingFace * Jeff Delaney Timestamps * [00:00:00] Intros & Michael's early interest in computer vision * [00:03:14] Pivoting to NLP and natural language question answering models * [00:07:20] Building a search engine index of Common Crawl and web pages * [00:11:26] Releasing the first version of Hello based on the search index and BigScience T0 model * [00:14:02] Deciding to focus the search engine specifically for programmers * [00:17:39] Overview of Phind's current product and focus on code reasoning * [00:21:51] The future vision for Phind to go from idea to complete code * [00:24:03] Transitioning to using the GPT-4 model and the impact it had * [00:29:43] Developing the Phind model based on CodeLlama and additional training * [00:32:28] Plans to continue improving the Phind model with open source technologies * [00:43:59] The story of meeting Paul Graham and Ron Conway and how that impacted the company * [00:53:02] How Ron Conway helped them get GPUs from Nvidia * [00:57:12] Tips on how Michael learns complex AI topics * [01:01:12] Lightning Round Transcript Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence and Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:19] Swyx: Hey, and today we have in the studio Michael Royzen from Phind. Welcome. [00:00:23] Michael: Thank you so much. [00:00:24] Alessio: It's great to be here. [00:00:25] Swyx: Yeah, we are recording this in a surprisingly hot October in San Francisco. And sometimes the studio works, but the blue angels are flying by right now, so sorry about the noise. So welcome. I've seen Phind blow up this year, mostly, I think since your launch in Feb and V2 and then your Hacker News posts. We tend to like to introduce our guests, but then obviously you can fill in the blanks with the origin story. You actually were a high school entrepreneur. You started SmartLens, which is a computer vision startup in 2017. [00:00:59] Michael: That's right. I remember when like TensorFlow came out and people started talking about, obviously at the time after AlexNet, the deep learning revolution was already in flow. Good computer vision models were a thing. And what really made me interested in deep learning was I got invited to go to Apple's WWDC conference as a student scholar because I was really into making iOS apps at the time. So I go there and I go to this talk where they added an API that let people run computer vision models on the device using far more efficient GPU primitives. After seeing that, I was like, oh, this is cool. This is going to have a big explosion of different computer vision models running locally on the iPhone. And so I had this crazy idea where it was like, what if I could just make this model that could recognize just about anything and have it run on the device? And that was the genesis for what eventually became SmartLens. I took this data set called ImageNet 22K. So most people, when they think of ImageNet, think of ImageNet 1K. But the full ImageNet actually has, I think, 22,000 different categories. So I took that, filtered it, pre-processed it, and then did a massive fine tune on Inception V3, which was, I think, the state of the art deep convolutional computer vision model at the time. And to my surprise, it actually worked insanely well. I had no idea what would happen if I give a single model. I think it ended up being 17,000 categories approximately that I collapsed them into. It worked so well that it actually worked better than Google Lens, which released its V1 around the same time. And on top of this, the model ran on the device. So it didn't need an internet connection. A big part of the issue with Google Lens at the time was that connections were slower. 4G was around, but it wasn't nearly as fast. So there was a noticeable lag having to upload an image to a server and get it back. But just processing it locally, even on the iPhones of the day in 2017, much faster. It was a cool little project. It got some traction. TechCrunch wrote about it. There was kind of like one big spike in usage, and then over time it tapered off. But people still pay for it, which is wild. [00:03:14] Swyx: That's awesome. Oh, it's like a monthly or annual subscription? [00:03:16] Michael: Yeah, it's like a monthly subscription. [00:03:18] Swyx: Even though you don't actually have any servers? [00:03:19] Michael: Even though we don't have any servers. That's right. I was in high school. I had a little bit of money. I was like, yeah. [00:03:25] Swyx: That's awesome. I always wonder what the modern equivalents kind of "Be my eyes". And it would be actually disclosed in the GPT-4 Vision system card recently that the usage was surprisingly not that frequent. The extent to which all three of us have our sense of sight. I would think that if I lost my sense of sight, I would use Be My Eyes all the time. The average usage of Be My Eyes per day is 1.5 times. [00:03:49] Michael: Exactly. I was thinking about this as well, where I was also looking into image captioning, where you give a model an image and then it tells you what's in the image. But it turns out that what people want is the exact opposite. People want to give a description of an image and then have the AI generate the image. [00:04:04] Alessio: Oh, the other way. [00:04:06] Michael: Exactly. And so at the time, I think there were some GANs, NVIDIA was working on this back in 2019, 2020. They had some impressive, I think, face GANs where they had this model that would produce these really high quality portraits, but it wasn't able to take a natural language description the way Midjourney or DALL-E 3 can and just generate you an image with exactly what you described in it. [00:04:32] Swyx: And how did that get into NLP? [00:04:35] Michael: Yeah, I released the SmartLens app and that was around the time I was a senior in high school. I was applying to college. College rolls around. I'm still sort of working on updating the app in college. But I start thinking like, hey, what if I make an enterprise version of this as well? At the time, there was Clarify that provided some computer vision APIs, but I thought this massive classification model works so well and it's so small and so fast, might as well build an enterprise product. And I didn't even talk to users or do any of those things that you're supposed to do. I was just mainly interested in building a type of backend I've never built before. So I was mainly just doing it for myself just to learn. I built this enterprise classification product and as part of it, I'm also building an invoice processing product where using some of the aspects that I built previously, although obviously it's very different from classification, I wanted to be able to just extract a bunch of structured data from an unstructured invoice through our API. And that's what led me to Hugnyface for the f
The first workshops and talks from the AI Engineer Summit are now up! Join the >20k viewers on YouTube, find clips on Twitter (we’re also clipping @latentspacepod), and chat with us on Discord! Text-to-SQL was one of the first applications of NLP. Thoughtspot offered “Ask your data questions” as their core differentiation compared to traditional dashboarding tools. In a way, they provide a much friendlier interface with your own structured (aka “tabular”, as in “SQL tables”) data, the same way that RLHF and Instruction Tuning helped turn the GPT-3 of 2020 into the ChatGPT of 2022. Today, natural language queries on your databases are a commodity. There are 4 different ChatGPT plugins that offer this, as well as a bunch of startups like one of our previous guests, Seek.ai. Perplexity originally started with a similar product in 2022: In March 2023 LangChain wrote a blog post on LLMs and SQL highlighting why they don’t consistently work: * “LLMs can write SQL, but they are often prone to making up tables, making up field” * “LLMs have some context window which limits the amount of text they can operate over” * “The SQL it writes may be incorrect for whatever reason, or it could be correct but just return an unexpected result.” For example, if you ask a model to “return all active users in the last 7 days” it might hallucinate a `is_active` column, join to an `activity` table that doesn’t exist, or potentially get the wrong date (especially in leap years!). We previously talked to Shreya Rajpal at Guardrails AI, which also supports Text2SQL enforcement. Their approach was to run the actual SQL against your database and then use the error messages to improve the query: Semantic Layers to the rescue Cube is an open source semantic layer which recently integrated with LangChain to solve these issues in a different way. You can use YAML, Javascript, or Python to create definitions of different metrics, measures and dimensions for your data: Creating these metrics and passing them in the model context limits the possibility for errors as the model just needs to query the `active_users` view, and Cube will then expand that into the full SQL in a reliable way. The downside of this approach compared to the Guardrails one for example is that it requires more upfront work to define metrics, but on the other hand it leads to more reliable and predictable outputs. The promise of adding a great semantic layer to your LLM app is irresistible - you greatly minimize hallucinations, make much more token efficient prompts, and your data stays up to date without any retraining or re-indexing. However, there are also difficulties with implementing semantic layers well, so we were glad to go deep on the topic with Artem as one of the leading players in this space! Timestamps * [00:00:00] Introductions * [00:01:28] Statsbot and limitations of natural language processing in 2017 * [00:04:27] Building Cube as the infrastructure for Statsbot * [00:08:01] Open sourcing Cube in 2019 * [00:09:09] Explaining the concept of a semantic layer/Cube * [00:11:01] Using semantic layers to provide context for AI models working with tabular data * [00:14:47] Workflow of generating queries from natural language via semantic layer * [00:21:07] Using Cube to power customer-facing analytics and natural language interfaces * [00:22:38] Building data-driven AI applications and agents * [00:25:59] The future of the modern data stack * [00:29:43] Example use cases of Slack bots powered by Cube * [00:30:59] Using GPT models and limitations around math * [00:32:44] Tips for building data-driven AI apps * [00:35:20] Challenges around monetizing embedded analytics * [00:36:27] Lightning Round Transcript Swyx: Hey everyone, welcome to the Latent Space podcast. This is Swyx, writer, editor of Latent Space and founder of Smol.ai and Alessio, partner and CTO in residence at Decibel Partners. [00:00:15] Alessio: Hey everyone, and today we have Artem Keydunov on the podcast, co-founder of Cube. Hey Artem. [00:00:21] Artem: Hey Alessio, hi Swyx. Good to be here today, thank you for inviting me. [00:00:25] Alessio: Yeah, thanks for joining. For people that don't know, I've known Artem for a long time, ever since he started Cube. And Cube is actually a spin-out of his previous company, which is Statsbot. And this kind of feels like going both backward and forward in time. So the premise of Statsbot was having a Slack bot that you can ask, basically like text to SQL in Slack, and this was six, seven years ago, something like that. A lot ahead of its time, and you see startups trying to do that today. And then Cube came out of that as a part of the infrastructure that was powering Statsbot. And Cube then evolved from an embedded analytics product to the semantic layer and just an awesome open source evolution. I think you have over 16,000 stars on GitHub today, you have a very active open source community. But maybe for people at home, just give a quick like lay of the land of the original Statsbot product. You know, what got you interested in like text to SQL and what were some of the limitations that you saw then, the limitations that you're also seeing today in the new landscape? [00:01:28] Artem: I started Statsbot in 2016. The original idea was to just make sort of a side project based off my initial project that I did at a company that I was working for back then. And I was working for a company that was building software for schools, and we were using Slack a lot. And Slack was growing really fast, a lot of people were talking about Slack, you know, like Slack apps, chatbots in general. So I think it was, you know, like another wave of, you know, bots and all that. We have one more wave right now, but it always comes in waves. So we were like living through one of those waves. And I wanted to build a bot that would give me information from different places where like a data lives to Slack. So it was like developer data, like New Relic, maybe some marketing data, Google Analytics, and then some just regular data, like a production database, so it sells for sometimes. And I wanted to bring it all into Slack, because we were always chatting, you know, like in Slack, and I wanted to see some stats in Slack. So that was the idea of Statsbot, right, like bring stats to Slack. I built that as a, you know, like a first sort of a side project, and I published it on Reddit. And people started to use it even before Slack came up with that Slack application directory. So it was a little, you know, like a hackish way to install it, but people are still installing it. So it was a lot of fun. And then Slack kind of came up with that application directory, and they reached out to me and they wanted to feature Statsbot, because it was one of the already being kind of widely used bots on Slack. So they featured me on this application directory front page, and I just got a lot of, you know, like new users signing up for that. It was a lot of fun, I think, you know, like, but it was sort of a big limitation in terms of how you can process natural language, because the original idea was to let people ask questions directly in Slack, right, hey, show me my, you know, like opportunities closed last week or something like that. My co founder, who kind of started helping me with this Slack application, him and I were trying to build a system to recognize that natural language. But it was, you know, we didn't have LLMs right back then and all of that technology. So it was really hard to build the system, especially the systems that can kind of, you know, like keep talking to you, like maintain some sort of a dialogue. It was a lot of like one off requests, and like, it was a lot of hit and miss, right? If you know how to construct a query in natural language, you will get a result back. But you know, like, it was not a system that was capable of, you know, like asking follow up questions to try to understand what you actually want. And then kind of finally, you know, like, bring this all context and go to generate a SQL query, get the result back and all of that. So that was a really missing part. And I think right now, that's, you know, like, what is the difference? So right now, I kind of bullish that if I would start Statsbot again, probably would have a much better shot at it. But back then, that was a big limitation. We kind of build a queue, right, as we were working on Statsbot, because we needed it. [00:04:27] Alessio: What was the ML stack at the time? Were you building, trying to build your own natural language understanding models, like were there open source models that were good that you were trying to leverage? [00:04:38] Artem: I think it was mostly combination of a bunch of things. And we tried a lot of different approaches. The first version, which I built, like was Regex. They were working well. [00:04:47] Swyx: It's the same as I did, I did option pricing when I was in finance, and I had a natural language pricing tool thing. And it was Regex. It was just a lot of Regex. [00:04:59] Artem: Yeah. [00:05:00] Artem: And my co-founder, Pavel, he's much smarter than I am. He's like PhD in math, all of that. And he started to do some stuff. I was like, no, you just do that stuff. I don't know. I can do Regex. And he started to do some models and trying to either look at what we had on the market back then, or try to build a different sort of models. Again, we didn't have any foundation back in place, right? We wanted to try to use existing math, obviously, right? But it was not something that we can take the model and try and run it. I think in 2019, we started to see more of stuff, like ecosystem being built, and then it eventually kind of resulted in all this LLM, like what we have right now. But back then in 2016, it was not much available for just the people to build on top. It was some academic research, right, kind of been happ
Thanks to the over 17,000 people who have joined the first AI Engineer Summit! A full recap is coming. Last call to fill out the State of AI Engineering survey! See our Community page for upcoming meetups in SF, Paris and NYC. This episode had good interest on Twitter and was discussed on the Vanishing Gradients podcast. Fast.ai’s “Practical Deep Learning” courses been watched by over >6,000,000 people, and the fastai library has over 25,000 stars on Github. Jeremy Howard, one of the creators of Fast, is now one of the most prominent and respected voices in the machine learning industry; but that wasn’t always the case. Being non-consensus and right In 2018, Jeremy and Sebastian Ruder published a paper on ULMFiT (Universal Language Model Fine-tuning), a 3-step transfer learning technique for NLP tasks: The paper demonstrated that pre-trained language models could be fine-tuned on a specific task with a relatively small amount of data to achieve state-of-the-art results. They trained a 24M parameters model on WikiText-103 which was beat most benchmarks. While the paper had great results, the methods behind weren’t taken seriously by the community: “Everybody hated fine tuning. Everybody hated transfer learning. I literally did tours trying to get people to start doing transfer learning and nobody was interested, particularly after GPT showed such good results with zero shot and few shot learning […] which I was convinced was not the right direction, but who's going to listen to me, cause as you said, I don't have a PhD, not at a university… I don't have a big set of computers to fine tune huge transformer models.” Five years later, fine-tuning is at the center of most major discussion topics in AI (we covered some like fine tuning vs RAG and small models fine tuning), and we might have gotten here earlier if Jeremy had OpenAI-level access to compute and distribution. At heart, Jeremy has always been “GPU poor”: “I've always been somebody who does not want to build stuff on lots of big computers because most people don't have lots of big computers and I hate creating stuff that most people can't use.” This story is a good reminder of how some of the best ideas are hiding in plain sight; we recently covered RWKV and will continue to highlight the most interesting research that isn’t being done in the large labs. Replacing fine-tuning with continued pre-training Even though fine-tuning is now mainstream, we still have a lot to learn. The issue of “catastrophic forgetting” and potential solutions have been brought up in many papers: at the fine-tuning stage, the model can forget tasks it previously knew how to solve in favor of new ones. The other issue is apparent memorization of the dataset even after a single epoch, which Jeremy covered Can LLMs learn from a single example? but we still don’t have the answer to. Despite being the creator of ULMFiT, Jeremy still professes that there are a lot of open questions on finetuning: “So I still don't know how to fine tune language models properly and I haven't found anybody who feels like they do.” He now advocates for "continued pre-training" - maintaining a diversity of data throughout the training process rather than separate pre-training and fine-tuning stages. Mixing instructional data, exercises, code, and other modalities while gradually curating higher quality data can avoid catastrophic forgetting and lead to more robust capabilities (something we covered in Datasets 101). “Even though I originally created three-step approach that everybody now does, my view is it's actually wrong and we shouldn't use it… the right way to do this is to fine-tune language models, is to actually throw away the idea of fine-tuning. There's no such thing. There's only continued pre-training. And pre-training is something where from the very start, you try to include all the kinds of data that you care about, all the kinds of problems that you care about, instructions, exercises, code, general purpose document completion, whatever. And then as you train, you gradually curate that, you know, you gradually make that higher and higher quality and more and more specific to the kinds of tasks you want it to do. But you never throw away any data…. So yeah, that's now my view, is I think ULMFiT is the wrong approach. And that's why we're seeing a lot of these so-called alignment tax… I think it's actually because people are training them wrong. An example of this phenomena is CodeLlama, a LLaMA2 model finetuned on 500B tokens of code: while the model is much better at code, it’s worse on generic tasks that LLaMA2 knew how to solve well before the fine-tuning. In the episode we also dive into all the places where open source model development and research is happening (academia vs Discords - tracked on our Communities list and on our survey), and how Jeremy recommends getting the most out of these diffuse, pseudonymous communities (similar to the Eleuther AI Mafia). Show Notes * Jeremy’s Background * FastMail * Optimal Decisions * Kaggle * Enlitic * fast.ai * Rachel Thomas * Practical Deep Learning * fastai for PyTorch * nbdev * fastec2 (the underrated library we describe) * Can LLMs learn from a single example? * the Kaggle LLM Science Exam competition, which “challenges participants to answer difficult science-based questions written by a Large Language Model”. * Sebastian Ruder * Alec Radford * Sylvain Gugger * Stephen Merity * Chris Lattner * Modular.ai / Mojo * Jono Whittaker * Zeiler and Fergus paper * ULM Fit * DAWNBench * Phi-1 * Code Llama * AlexNet Timestamps * [00:00:00] Intros and Jeremy’s background * [00:05:28] Creating ULM Fit - a breakthrough in NLP using transfer learning * [00:06:32] The rise of GPT and the appeal of few-shot learning over fine-tuning * [00:10:00] Starting Fast.ai to distribute AI capabilities beyond elite academics * [00:14:30] How modern LMs like ChatGPT still follow the ULM Fit 3-step approach * [00:17:23] Meeting with Chris Lattner on Swift for TensorFlow at Google * [00:20:00] Continued pre-training as a fine-tuning alternative * [00:22:16] Fast.ai and looking for impact vs profit maximization * [00:26:39] Using Fast.ai to create an "army" of AI experts to improve their domains * [00:29:32] Fast.ai's 3 focus areas - research, software, and courses * [00:38:42] Fine-tuning memorization and training curve "clunks" before each epoch * [00:46:47] Poor training and fine-tuning practices may be causing alignment failures * [00:48:38] Academia vs Discords * [00:53:41] Jeremy's high hopes for Chris Lattner's Mojo and its potential * [01:05:00] Adding capabilities like SQL generation through quick fine-tuning * [01:10:12] Rethinking Fast.ai courses for the AI-assisted coding era * [01:14:53] Rapid model development has created major technical debt * [01:17:08] Lightning Round AI Summary (beta) This is the first episode we’re trying this. Here’s an overview of the main topics before you dive in the transcript. * Jeremy's background and philosophies on AI * Studied philosophy and cognitive science in college * Focused on ethics and thinking about AI even 30 years ago * Believes AI should be accessible to more people, not just elite academics/programmers * Created fast.ai to make deep learning more accessible * Development of transfer learning and ULMFit * Idea of transfer learning critical for making deep learning accessible * ULMFit pioneered transfer learning for NLP * Proposed training general language models on large corpora then fine-tuning - this became standard practice * Faced skepticism that this approach would work from NLP community * Showed state-of-the-art results on text classification soon after trying it * Current open questions around fine-tuning LLMs * Models appear to memorize training data extremely quickly (after 1 epoch) * This may hurt training dynamics and cause catastrophic forgetting * Unclear how best to fine-tune models to incorporate new information/capabilities * Need more research on model training dynamics and ideal data mixing * Exciting new developments * Mojo and new programming languages like Swift could enable faster model innovation * Still lots of room for improvements in computer vision-like innovations in transformers * Small models with fine-tuning may be surprisingly capable for many real-world tasks * Prompting strategies enable models like GPT-3 to achieve new skills like playing chess at superhuman levels * LLMs are like computer vision in 2013 - on the cusp of huge new breakthroughs in capabilities * Access to AI research * Many key convos happen in private Discord channels and forums * Becoming part of these communities can provide great learning opportunities * Being willing to do real work, not just talk about ideas, is key to gaining access * The future of practical AI * Coding becoming more accessible to non-programmers through AI assistance * Pre-requisite programming experience for learning AI may no longer be needed * Huge open questions remain about how to best train, fine-tune, and prompt LLMs Transcript Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:21] Swyx: Hey, and today we have in the remote studio, Jeremy Howard all the way from Australia. Good morning. [00:00:27] Jeremy: The remote studio, also known as my house. Good morning. Nice to see you. [00:00:32] Swyx: Nice to see you too. I'm actually very used to seeing you in your mask as a message to people, but today we're mostly audio. But thank you for doing the very important public service of COVID awareness. It was a pleasure. [00:00:46] Jeremy: It was all very annoying and frustrating and tedious, but somebody had to do it. [00:00:52] Swyx: Somebody had to do it, especially somebody with your profile. I think it really drives home the
Thanks to the over 11,000 people who joined us for the first AI Engineer Summit! A full recap is coming, but you can 1) catch up on the fun and videos on Twitter and YouTube, 2) help us reach 1000 people for the first comprehensive State of AI Engineering survey and 3) submit projects for the new AI Engineer Foundation. See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore. This episode had good interest on Twitter. Last month, Imbue was crowned as AI’s newest unicorn foundation model lab, raising a $200m Series B at a >$1 billion valuation. As “stealth” foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out. However, ever since their $20m Series A last year their goal has been to “develop generally capable AI agents with human-like intelligence in order to solve problems in the real world”. From RL to Reasoning LLMs Along with their Series A, they announced Avalon, “A Benchmark for RL Generalization Using Procedurally Generated Worlds”. Avalon is built on top of the open source Godot game engine, and is ~100x faster than Minecraft to enable fast RL benchmarking and a clear reward with adjustable game difficulty. After a while, they realized that pure RL isn’t a good path to teach reasoning and planning. The agents were able to learn mechanical things like opening complex doors, climbing, but couldn’t go to higher level tasks. A pure RL world also doesn’t include a language explanation of the agent reasoning, which made it hard to understand why it made certain decisions. That pushed the team more towards the “models for reasoning” path: “The second thing we learned is that pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were able to learn all sorts of crazy things: They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing.” Inspired by Chelsea Finn’s work on SayCan at Stanford, the team pivoted to have their agents do the reasoning in natural language instead. This development parallels the large leaps in reasoning that humans have developed as the scientific method: “We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask: * What was the original claim that was made? * What evidence is there for this claim? * Does the evidence support the claim? * Is the claim correct? This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we can generate data that's much more specific to them.“ The Full Stack Model Lab One year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation from Astera Institute, NVIDIA, Cruise CEO Kyle Vogt, Notion co-founder Simon Last, and others. Imbue tackles their work with a “full stack” approach: * Models. Pretraining very large (>100B parameter) models, optimized to perform well on internal reasoning benchmarks, with a ~10,000 Nvidia H100 GPU cluster lets us iterate rapidly on everything from training data to architecture and reasoning mechanisms. * Tools and Agents. Building internal productivity tools from coding agents for fixing type checking and linting errors, to sophisticated systems like CARBS (for hyperparameter tuning and network architecture search). * Interface Invention. Solving agent trust and collaboration (not merely communication) with humans by creating better abstractions and interfaces — IDEs for users to program computers in natural language. * Theory. Publishing research about the theoretical underpinnings of self-supervised learning, as well as scaling laws for machine learning research. Kanjun believes we are still in the “bare metal phase” of agent development, and they want to take a holistic approach to building the “operating system for agents”. We loved diving deep into the Imbue approach toward solving the AI Holy Grail of reliable agents, and are excited to share our conversation with you today! Timestamps * [00:00:00] Introductions * [00:06:07] The origin story of Imbue * [00:09:39] Imbue's approach to training large foundation models optimized for reasoning * [00:12:18] Imbue's goals to build an "operating system" for reliable, inspectable AI agents * [00:15:37] Imbue's process of developing internal tools and interfaces to collaborate with AI agents * [00:17:27] Imbue's focus on improving reasoning capabilities in models, using code and other data * [00:19:50] The value of using both public benchmarks and internal metrics to evaluate progress * [00:21:43] Lessons learned from developing the Avalon research environment * [00:23:31] The limitations of pure reinforcement learning for general intelligence * [00:28:36] Imbue's vision for building better abstractions and interfaces for reliable agents * [00:31:36] Interface design for collaborating with, rather than just communicating with, AI agents * [00:37:40] The future potential of an agent-to-agent protocol * [00:39:29] Leveraging approaches like critiquing between models and chain of thought * [00:45:49] Kanjun's philosophy on enabling team members as creative agents at Imbue * [00:53:51] Kanjun's experience co-founding the communal co-living space The Archive * [01:00:22] Lightning Round Show Notes * Imbue * Avalon * CARBS (hyperparameter optimizer) * Series B announcement * Kanjun/Imbue’s Podcast * MIT Media Lab * Research mentioned: * Momentum Contrast * SimClr * Chelsea Finn - SayCan * Agent Protocol - part of the AI Engineer Foundation * Xerox PARC * Michael Nielsen * Jason Benn * Outset Capital * Scenius - Kevin Kelly * South Park Commons * The Archive * Thursday Nights in AI Transcript Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19] Swyx: Hey, and today in the studio we have Kanjun from Imbue. Welcome. So you and I have, I guess, crossed paths a number of times. You're formerly named Generally Intelligent and you've just announced your rename, rebrand in huge, humongous ways. So congrats on all of that. And we're here to dive in into deeper detail on Imbue. We like to introduce you on a high level basis, but then have you go into a little bit more of your personal side. So you graduated your BS at MIT and you also spent some time at the MIT Media Lab, one of the most famous, I guess, computer hacking labs in the world. Then you graduated MIT and you went straight into BizOps at Dropbox, where you're eventually chief of staff, which is a pretty interesting role we can dive into later. And then it seems like the founder bug hit you. You were basically a three times founder at Ember, Sorceress, and now at Generally Intelligent slash Imbue. What should people know about you on the personal side that's not on your LinkedIn? That's something you're very passionate about outside of work. [00:01:12] Kanjun: Yeah. I think if you ask any of my friends, they would tell you that I'm obsessed with agency, like human agency and human potential. [00:01:19] Swyx: That's work. Come on. Kanjun: It's not work. What are you talking about? Swyx: So what's an example of human agency that you try to promote? [00:01:27] Kanjun: With all of my friends, I have a lot of conversations with them that's kind of helping figure out what's blocking them. I guess I do this with a team kind of automatically too. And I think about it for myself often, like building systems. I have a lot of systems to help myself be more effective. At Dropbox, I used to give this onboarding talk called How to Be Effective, which people liked. I think like a thousand people heard this onboarding talk, and I think maybe Dropbox was more effective. I think I just really believe that as humans, we can be a lot more than we are. And it's what drives everything. I guess completely outside of work, I do dance. I do partner dance. [00:02:03] Swyx: Yeah. Lots of interest in that stuff, especially in the sort of group living houses in San Francisco, which I've been a little bit part of, and you've also run one of those. [00:02:12] Kanjun: That's right. Yeah. I started the archive with two friends, with Josh, my co-founder, and a couple of other folks in 2015. That's right. And GPT-3, our housemates built. [00:02:22] Swyx: Was that the, I guess, the precursor to Generally Intelligent, that you started doing more things with Josh? Is that how that relationship started? Yeah. [00:02:30] Kanjun: This is our third company together. Our first company, Josh poached me from Dropbox for Ember. And there we built a really interesting technology, laser raster projector, VR headset. And then we were like, VR is not the thing we're most passionate about. And actually it was kind of early days when we both realized we really do believe that in our lifetimes, like computers that are intelligent are going to be able to allow us to do much more than we can do today as people and be much more as people than we can be today. And at that time, we actually, after Ember, we were like, work on AI research or start an AI lab. A bunch of our housemates were joining OpenAI, and we actually decided to do something more pragmati
This is a special double weekend crosspost of AI podcasts, helping attendees prepare for the AI Engineer Summit next week. After our first friendly feedswap with the Cognitive Revolution pod, swyx was invited for a full episode to go over the state of AI Engineering and to preview the AI Engineer Summit Schedule, where we share many former CogRev guests as speakers. For those seeking to understand how two top AI podcasts think about major top of mind AI Engineering topics, this should be the perfect place to get up to speed, which will be a preview of many of the conversations taking place during the topic tables sessions on the night of Monday October 9 at the AI Engineer Summit. While you are listening, there are two things you can do to be part of the AI Engineer experience. One, join the AI Engineer Summit Slack. Two, take the State of AI Engineering survey and help us get to 1000 respondents! Links * AI Engineer Summit (Join livestream and Slack community) * State of AI Engineering Survey (please help us fill this out to represent you!) * Cognitive Revolution full episode with Nathan * swyx’s ai-notes (featuring Communities in README.md) * We referenced The Eleuther AI Mafia * This podcast intro voice was AI Anna again, from our Wondercraft pod! Timestamps * (00:00:49) AI Nathan’s intro * (00:03:14) What is an AI engineer? * (00:05:56) What backgrounds do AI engineers typically have? * (00:17:13) Swyx’s Discord AI project * (00:20:41) Key tools for AI engineers * (00:23:42) HumanLoop, Guardrails, Langchain * (00:27:01) Criteria for identifying capable AI engineers when hiring * (00:30:59) Skepticism around AI being a fad and doubts about contributing to AI * (00:34:03) AI Engineer Conference speaker lineup * (00:41:14) AI agents and two years to AGI * (00:46:04) Expectations and disagreement around what AI agent capabilities will work soon * (00:50:12) Swyx’s OpenAI thesis * (00:53:03) AI safety considerations and the role of AI engineers * (00:56:24) Disagreement on whether AI will soon be able to generate code pull requests * (01:01:07) AI helping non-technical people to code * (01:01:49) Multi-modal Chat-GPT and the future implications * (01:03:33) Nathan living in the same dorm as Mark Zuckerberg * (01:04:44) Competitive dynamics between OpenAI and other AI model developers * (01:05:39) Play.ht vs ElevenLabs * (01:09:20) The tension between platforms and developers building on top of them * (01:11:40) The best thing startups can do to compete with foundation model providers * (01:16:26) User identity/authentication services like Login with OpenAI * (01:19:20) Google vs the other live players * (01:20:46) AI Horcruxes / Pendants * (01:22:05) The concept of an AI app bundle for consumers and developers This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
This is a special double weekend crosspost of AI podcasts, helping attendees prepare for the AI Engineer Summit next week. Swyx gave a keynote on the Software 3.0 Landscape recently (referenced in our recent Humanloop episode) and was invited to go deeper in podcast format, and to preview the AI Engineer Summit Schedule. For those seeking to ramp up on the current state of thinking on AI Engineering, this should be the perfect place to start, alongside our upcoming Latent Space University course (which is being tested live for the first time at the Summit workshops). While you are listening, there are two things you can do to be part of the AI Engineer experience. One, join the AI Engineer Summit Slack. Two, take the State of AI Engineering survey and help us get to 1000 respondents! Full transcript available here! Links * AI Engineer Summit (Join livestream and Slack community) * State of AI Engineering Survey (please help us fill this out to represent you!) * Podrocket full episode by Tejas Kumar Show notes * Explaining Software 1.0, 2.0, and 3.0 * Software 1.0: Hand-coded software with conditional logic, loops, etc. * Software 2.0: Machine learning models like neural nets trained on data * Software 3.0: Using large pre-trained foundation models without needing to collect/label training data * Foundation Models and Model Architecture * Foundation models like GPT-3/4, Claude, Whisper - can be used off the shelf via API * Model architecture refers to the layers and structure of a ML model * Grabbing a pre-trained model lets you skip data collection and training * Putting Foundation Models into Production * Levels of difficulty: calling an API, running locally, fully serving high-volume predictions * Key factors: GPU utilization, batching, infrastructure expertise * The Emerging AI Developer Landscape * AI is becoming more accessible to "traditional" software engineers * Distinction between ML engineers and new role of AI engineers * AI engineers consume foundation model APIs vs. developing models from scratch * The Economics of AI Engineers * Demand for AI exceeds supply of ML experts to build it * AI engineers will emerge out of software engineers learning these skills * Defining the AI Engineering Stack * System of reasoning: Foundation model APIs * Retrieval augmented generation (RAG) stack: Connects models to data * AI UX: New modalities and interfaces beyond chatbots * Building Products with Foundation Models * Replicating existing features isn't enough - need unique value * Focus on solving customer problems and building trust * AI Skepticism and Hype * Some skepticism is healthy, but "AI blame" also emerges * High expectations from media/industry creators * Important to stay grounded in real customer needs * Meaningful AI Applications * Many examples of AI positively impacting lives already * Engineers have power to build and explore - lots of opportunity * Closing and AI Engineer Summit Details * October 8-10 virtual conference for AI engineers * Speakers from OpenAI, Microsoft, Amazon, etc * Free to attend online This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Want to help define the AI Engineer stack? >800 folks have weighed in on the top tools, communities and builders for the first State of AI Engineering survey, which we will present for the first time at next week’s AI Engineer Summit. Join us online! This post had robust discussion on HN and Twitter. In October 2022, Robust Intelligence hosted an internal hackathon to play around with LLMs which led to the creation of two of the most important AI Engineering tools: LangChain 🦜⛓️ (our interview with Harrison here) and LlamaIndex 🦙 by Jerry Liu, which we’ll cover today. In less than a year, LlamaIndex has crossed 600,000 monthly downloads, raised $8.5M from Greylock, has a fast growing open source community that contributes to LlamaHub, and it doesn’t seem to be slowing down. LlamaIndex’s Origin (aka GPT Tree Index) Jerry struggled to make large amounts of data work with GPT-3 (which had a 4,096 tokens context window). Today LlamaIndex is at the forefront of the RAG wave (Retrieval Augmented Generation), but in the beginning Jerry wasn’t focused on embeddings and search, but rather on understanding how models could summarize, link, and reason about data. On November 5th, Jerry pushed the first version to Github under the name “GPT Tree Index”: The GPT Tree Index first takes in a large dataset of unprocessed text data as input. It then builds up a tree-index in a bottom-up fashion; each parent node is able to summarize the children nodes using a general summarization prompt; each intermediate node containing summary text summarizing the components below. Once the index is built, it can be saved to disk and loaded for future use. Then, say the user wants to use GPT-3 to answer a question. Using a query prompt template, GPT-3 will be able to recursively perform tree traversal in a top-down fashion in order to answer a question. For example, in the very beginning GPT-3 is tasked with selecting between *n* top-level nodes which best answers a provided query, by outputting a number as a multiple-choice problem. The GPT Tree Index then uses the number to select the corresponding node, and the process repeats recursively among the children nodes until a leaf node is reached. […] How is this better than an embeddings-based approach / other state-of-the-art QA and retrieval methods? The intent is not to compete against existing methods. A simpler embedding-based technique could be to just encode each chunk as an embedding and do a simple question-document embedding look-up to retrieve the result. This project is a simple exercise to test how GPT can organize and lookup information. The project attracted a lot of attention early on (the announcement tweet has ~330 likes), but it wasn’t until ~February 2023 that the open source community really started to explode, which was around the same time that LlamaHub was released. LlamaHub made it easy for developers to import data from Google Drive, Discord, Slack, databases, and more into their LlamaIndex projects. What is LlamaIndex? As we mentioned, LlamaIndex is leading the charge in the development of the RAG stack. RAG boils down to two parts: * Indexing (i.e. how do you load and index the data in your knowledge base) * Querying (i.e. how do you surface the data and fit it in the model context) Indexing To get your data from all your sources to your RAG knowledge base, you can leverage a few tools: * Documents / Nodes: A Document is a generic container around any data source - for instance, a PDF, an API output, or retrieved data from a database. A Node is the atomic unit of data in LlamaIndex and represents a “chunk” of a source Document (i.e. one Document has many Node) as well as its relationship to other Node objects. * Data Connectors: A data connector ingest data from different sources and turn them into Document representations (text and simple metadata). These connectors are offered through LlamaHub, and there are over 200 of them today. * Data Indexes: Once you’ve ingested your data, LlamaIndex will help you index the data into a format that’s easy to retrieve. There are many types of indexes (Summary, Tree, Vector, etc). Under the hood, LlamaIndex parses the raw documents into intermediate representations, calculates vector embeddings, and infers metadata. The most commonly used index is the VectorStoreIndex, which can then be paired with any of the vector stores out there (an example with Chroma). Querying The RAG pipeline, during the querying phase, sources the most pertinent context from a user's prompt, forwarding it along to the LLM. This equips the LLM with current / private knowledge beyond its foundational training data. LlamaIndex offers adaptable modules tailored for building RAG pathways for Q&A, chatbots, or agent use, since each of them has different requirements. For example, a chatbot should expect the user to interject with follow up questions, while an agent will try to carry out a whole task on its own without user intervention. Building Blocks * Retrievers: A retriever defines how to efficiently retrieve relevant context from a knowledge base (i.e. index) when given a query. Vector index is the most popular mode, but there are other options like Summary, Tree, Keyword Table, Knowledge Graph, and Document Summary. * Node Postprocessors: Once the retriever gets you Node objects back, you will need to do additional work like discarding low similarity ones. There are many options here as well, such as `SimilarityPostprocessor` (i.e. drop nodes below a certain similarity score) or `LongContextReorder` which helps avoid the issues raised in the “Lost in the Middle, U-shaped recollection curve” paper. * Response Synthesizers: Takes a user query and your retrieved chunks, and prompts and LLM with them. There are a few response modes here that balance thoroughness and compactness. Pipelines * Query Engines: A query engine is an end-to-end pipeline that allow you to ask question over your data. It takes in a natural language query, and returns a response, along with reference context retrieved and passed to the LLM. This makes it possible to do things like “Ask panda questions” by leveraging Panda dataframes as a data source. * Chat Engines: A chat engine is an end-to-end pipeline for having a conversation with your data (multiple back-and-forth instead of a single question & answer). This supports traditional OpenAI-style chat interfaces, as well as more advanced ones like ReAct. * Agents: An agent is an automated decision maker (powered by an LLM) that interacts with the world via a set of tools. Agent may be used in the same fashion as query engines or chat engines, but they have the power to both read and write data. For reasoning, you can use either OpenAI Functions or ReAct. Both can leverage the tools offered through LlamaHub for further analysis. RAG vs Finetuning Now that you have a full overview of what LlamaIndex does, the next question is “When should I use this and when should I fine tune?”. Jerry’s TLDR is that “RAG is just a hack”, but a powerful one. Each option has pros and cons: * Lower investment: RAG requires almost 0 upfront investment, unlike finetuning which requires data cleaning, model training, increased costs for finetuned inference, etc. * Stricter access control and higher visibility: when finetuning, the model learns everything. With RAG, you can decide what documents the index should have access to, making it more secure by default. You are also able to see everything that was passed into the context if a response doesn’t look right. * Context window limitation: you can only fit so many tokens into the prompt due to the way models work. Finetuning helps you circumvent that by compressing the knowledge into the model weights rather than putting it in the prompt. As Jerry says, the best way to know this inside out is to learn to build RAG from scratch (without LlamaIndex) - and they have plenty of tutorials on his Twitter and blog to learn this. The other issue is that the math for finetuning isn’t well known yet as we discussed with Quentin Anthony from Eleuther, so unless you have money and time to invest into exploring fine tuning, you’re better off starting with RAG. Full YouTube Discussion! Show Notes * LlamaIndex * LlamaHub * SEC Insights * Robust Intelligence * Quora’s Poe * Chroma * Vespa * Why should every AI engineer learn to build RAG from scratch? * LangChain * Gorilla * Lost in the Middle: How Language Models Use Long Contexts Timestamps * [00:00:00] Introductions and Jerry’s background * [00:04:30] Starting LlamaIndex as a side project * [00:05:11] Evolution from tree-index to current LlamaIndex and LlamaHub architecture * [00:11:39] Deciding to leave Robust to start the LlamaIndex company and raising funding * [00:20:06] Context window size and information capacity for LLMs * [00:21:34] Minimum viable context and maximum context for RAG * [00:22:52] Fine-tuning vs RAG - current limitations and future potential * [00:24:02] RAG as a hack but good hack for now * [00:26:19] RAG benefits - transparency and access control * [00:27:46] Potential for fine-tuning to take over some RAG capabilities * [00:30:04] Baking everything into an end-to-end trained LLM * [00:33:24] Similarities between iterating on ML models and LLM apps * [00:34:47] Modularity and customization options in LlamaIndex: data loading, retrieval, synthesis, reasoning * [00:40:16] Evaluating and optimizing each component of Lama Index system * [00:46:02] Building retrieval benchmarks to evaluate RAG * [00:47:24] SEC Insights - open source full stack LLM app using LlamaIndex * [00:49:48] Enterprise platform to complement LlamaIndex open source * [00:51:00] Community contributions for LlamaHub data loaders * [00:53:21] LLM engine usage - majority OpenAI but options expanding * [00:56:25] Vector store landscape * [00:59:46] Exploring relationships and graphs within dat
Want to help define the AI Engineer stack? >500 folks have weighed in on the top tools, communities and builders for the first State of AI Engineering survey! Please fill it out (and help us reach 1000!) The AI Engineer Summit schedule is now live! We are running two Summits and judging two Hackathons this Oct. As usual, see our Discord and community page for all events. A rite of passage for every AI Engineer is shipping a quick and easy demo, and then having to cobble together a bunch of solutions for prompt sharing and versioning, running prompt evals and monitoring, storing data and finetuning as their AI apps go from playground to production. This happens to be Humanloop’s exact pitch. full show notes: https://latent.space/p/humanloop Timestamps * [00:01:21] Introducing Raza * [00:10:52] Humanloop Origins * [00:19:25] What is HumanLoop? * [00:20:57] Who is the Buyer of PromptOps? * [00:22:21] HumanLoop Features * [00:22:49] The Three Stages of Prompt Evals * [00:24:34] The Three Types of Human Feedback * [00:27:21] UI vs BI for AI * [00:28:26] LangSmith vs HumanLoop comparisons * [00:31:46] The TAM of PromptOps * [00:32:58] How to Be Early * [00:34:41] 6 Orders of Magnitude * [00:36:09] Becoming an Enterprise Ready AI Infra Startup * [00:40:41] Killer Usecases of AI * [00:43:56] HumanLoop's new Free Tier and Pricing * [00:45:20] Addressing Graduation Risk * [00:48:11] On Company Building * [00:49:58] On Opinionatedness * [00:51:09] HumanLoop Hiring * [00:52:42] How HumanLoop thinks about PMF * [00:55:16] Market: LMOps vs MLOps * [00:57:01] Impact of Multimodal Models * [00:57:58] Prompt Engineering vs AI Engineering * [01:00:11] LLM Cascades and Probabilistic AI Languages * [01:02:02] Prompt Injection and Prompt Security * [01:03:24] Finetuning vs HumanLoop * [01:04:43] Open Standards in LLM Tooling * [01:06:05] Did GPT4 Get Dumber? * [01:07:29] Europe's AI Scene * [01:09:31] Just move to SF (in The Arena) * [01:12:23] Lightning Round - Acceleration * [01:13:48] Continual Learning * [01:15:02] DeepMind Gato Explanation * [01:17:40] Motivations from Academia to Startup * [01:19:52] Lightning Round - The Takeaway This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Want to help define the AI Engineer stack? Have opinions on the top tools, communities and builders? We’re collaborating with friends at Amplify to launch the first State of AI Engineering survey! Please fill it out (and tell your friends)! In March, we started off our GPT4 coverage framing one of this year’s key forks in the road as the “Year of Multimodal vs Multimodel AI”. 6 months in, neither has panned out yet. The vast majority of LLM usage still defaults to chatbots built atop OpenAI (per our LangSmith discussion), and rumored GPU shortages have prevented the broader rollout of GPT-4 Vision. Most "AI media” demos like AI Drake and AI South Park turned out heavily human engineered, to the point where the AI label is more marketing than honest reflection of value contributed. However, the biggest impact of multimodal AI in our lives this year has been a relatively simple product - the daily HN Recap podcast produced by Wondercraft.ai, a 5 month old AI podcasting startup. As swyx observed, the “content flippening” — an event horizon when the majority of content you choose to consume is primarily AI generated/augmented rather than primarily human/manually produced — has now gone from unthinkable to possible. For full show notes, go to: https://latent.space/p/wondercraft Timestamps * [00:03:15] What is Wondercraft? * [00:08:22] Features of Wondercraft * [00:10:42] Types of Podcasts * [00:11:44] The Importance of Consistency * [00:14:01] Wondercraft House Podcasts * [00:19:27] Video Translation and Dubbing * [00:21:49] Building Wondercraft in 1 Day * [00:24:25] What is your moat? * [00:30:37] Audio Generation stack * [00:32:12] How Important is it to Sound Human? and AI Uncanny Valley * [00:36:02] AI Watermarking * [00:36:32] The Text to Speech Industry * [00:41:19] Voice Synthesis Research * [00:45:53] AI Podcaster interviews Human Podcaster * [00:50:38] Takeaway This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Want to help define the AI Engineer stack? Have opinions on the top tools, communities and builders? We’re collaborating with friends at Amplify to launch the first State of AI Engineering survey! Please fill it out (and tell your friends)! If AI is so important, why is its software so bad? This was the motivating question for Chris Lattner as he reconnected with his product counterpart on Tensorflow, Tim Davis, and started working on a modular solution to the problem of sprawling, monolithic, fragmented platforms in AI development. They announced a $30m seed in 2022 and, following their successful double launch of Modular/Mojo🔥 in May, have just announced their $100m Series A. While the performance claims of Mojo🔥 and its promise as a fully multithreaded compiled Python superset stole the show, we were amazed to learn that it is a side project - and the vision for Modular’s Python inference engine is at least as big. Listeners will recall that we last talked with George Hotz about his work on tinygrad and how he wants to replace PyTorch with something faster and lighter, handwriting a “reduced instruction set” of operators himself. But what if the problem could be solved at even lower level - with the Python engine/runtime itself? Chris on Compilers Chris’ history with compilers is well known - creating LLVM during his PhD (for which he won the 2012 ACM Software System Award), hired straight into Apple where he also made Clang and Swift (the iPhone programming language that replaced Objective-C), then leading the Tensorflow Infrastructure team at Google where he built XLA, a just-in-time compiler for optimizing a lot of the algebra behind TF’s workloads, and MLIR, a modular compiler framework that sat above LLVM to optimize ML graphs and kernels that were hard to represent in the LLVM IR. So as pretty much the best compiler engineer in human history, you’d justifiably assume that Chris is simply choosing to take his compiler approach to Python. And yet that is not how he thinks about compilers at all. As he says in our chat, “How do you enable invention? How do you get more kinds of people that understand different parts of this problem to actually collaborate? And so this is where I see our work on Mojo and on the engine… …I don't have a compiler hammer that I'm running around looking for compiler problems to hit.” Today a small number of people at companies like OpenAI spend a lot of time manually writing CUDA kernels. But an optimizing compiler for AI leads to compilers as a means to an end for increasing software collaboration, expanding the ability of people with different skillsets and knowledge. “…What is the fundamental purpose of a compiler? Well, it's to make it so that you don't have to know as much about the hardware. You could write everything in very low-level assembly code for every single problem that you have… But what a compiler really does is it allows you to express things at a higher level of abstraction.” For Chris, compilers are also ways to properly automate generalized optimizations that might otherwise be manually coded and brittle abstractions, like operator fusion: “So NVIDIA goes and they build this really cool library called FasterTransformer. The performance point of using it is massive. So a lot of LLM companies and other folks use this thing because they want the performance. …Here's the problem. If you want to go innovate in transformers, now you're constrained by what FasterTransformer can do, right? And so, again, you come back to where are compilers useful? They're useful for generalization. If you can get the same quality result or better than FasterTransformer, but with a generalized architecture, well now you can get the best of both worlds, where you have orthogonality and composability, you enable research, you also get better performance.” Done correctly, these operator optimizations being implemented at the compiler level amount to an “AI Engine” that can not only survive, but enable major architecture shifts should a credible alternative LLM architecture come along someday. Modular — the Unified AI Engine Modular’s original goal was to build the “Unified AI Engine” to speed up AI development and inference - one that doesn’t assume an “AI = GPUs” world that only benefits the “GPU-rich”, but one that treats AI as “a large-scale, heterogeneous, parallel compute problem”. Modular itself is an engine (separate from Mojo, which we cover below) that can run all other frameworks between 10% to 650% faster on CPUs (with GPU support coming in the fall): At Google, Chris’ job wasn’t to build the best possible compiler for AI. The goal was to build the best compiler for TPUs, so that all TensorFlow users would have a great Google Cloud experience. Similarly, the PyTorch team at Meta isn’t trying to make AI faster for the world, but mostly for their recommendations and ads systems. Chris and Tim realized that the AI engine and developer experience isn’t a product prioritized by any of the big tech companies (they tried) - so they see Modular as the best way to deliver the AI development platform of the future. The modularity of Modular shines through in the hot-swapping Inference Engine demo, which has to be seen to be believed. Mojo 🔥 — Blazing Fast Python The other piece of Modular is Mojo, a new programming language for AI that is a superset of Python. In some sense it is “the ultimate yak shave”: We were shocked to learn that Chris and the team didn’t initially set out to create Mojo, but it started life as an internal DSL to make themselves more productive. Mojo adopted Python’s syntax since it’s by far the most used language in machine learning and AI. It also lets them supports all existing PyPi packages, requiring no code changes for developers to go from Python to Mojo. Mojo comes with a lot of different underlying design choices that lead to much better performance: * It’s compiled rather than interpreted like Python * No GIL which allows for multi-threading * Better heap representation * Leverages MLIR In the perfect test scenario that leverages all of these improvements, Mojo is up to ~68,000x faster than Python 🔥 (fire emoji is a valid file extension for Mojo files, btw!). Of course, that is just one microbenchmark, but as Jeremy Howard explains, most Python codebases should run between 10-100x faster simply by moving to Mojo with very minor adjustments. A community member port of Llama2 from Python to Mojo shows it inferencing >100x faster than Python, and 20% faster than the handcoded raw C implementation. The Modular team is embarking in one of the hardest technical challenges we’ve seen a startup tackle, and we can’t wait to see what comes out of it. We had an amazing conversation with Chris diving into all the details, which we hope you enjoy! Show Notes * Modular AI * Chris’ personal website * Scott Forstall * Bret Victor’s Playgrounds * Karpathy’s Tweets * Speculative Execution * Llama memory constraints * LLVM * Clang * Swift * TensorFlow * PyTorch * XLA * MLIR * TPUs * Guido van Rossum Timestamps * [00:00:00] Introduction * [00:00:40] Chris's background - LLVM, Clang, Swift * [00:03:01] Chris's experience with Google TPUs and XLA * [00:05:47] The limitations of current frameworks like TensorFlow and PyTorch * [00:08:03] The benefits of using compilers for AI systems * [00:13:14] Enabling more collaboration between researchers through better systems * [00:20:55] Starting with CPU optimization instead of just GPUs * [00:24:36] Design principles and goals behind Modular * [00:32:41] The benefits of starting from a general compiler architecture * [00:35:13] Origins of deciding to create the Mojo language * [00:44:43] Goals for Mojo to become a true Python superset * [00:48:12] Thoughts on tinygrad * [00:52:00] ggml, quantization, etc * [00:57:00] Speculative execution and other gains from making Mojo more parallel * [01:01:50] Future of Mojo’s toolkit * [01:07:00] Why Modular is a company and not a foundation * [01:11:00] Learnings as a first time founder and engineering leader * [01:25:00] Lightning Round Transcript Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19] Swyx: Hey, and today we have Chris Lattner in the house. Welcome, Chris. [00:00:21] Chris: Hi both. Thanks for having me. [00:00:24] Swyx: We're so excited to have you. We have so many questions and we'll try to get through as many as we can. You're one of the easiest people to research I've ever had on the pod, because you document yourself extensively on https://nondot.org/sabre/. What's the story behind that, just quickly? [00:00:40] Chris: I mean, I've had that website for, since, I don't know, the mid-90s. So it's been a very, very, very long time, and I originally had a big personal page. Again, this was the mid-90s with all the scroll tags and all that kind of stuff. Yeah, exactly. [00:00:56] Swyx: The animated gifs. “Under construction.” [00:00:57] Chris: Yeah. It has been rebooted a few times, and web design is not my strong point, but the server was originally named after some fish we had. That was the origin of non-dot. [00:01:08] Swyx: I love it. I looked on Tanya's page and she has some spaniels. [00:01:12] Chris: Yep. We're dog people. We love many animals. [00:01:15] Swyx: So your quick bio, you did your PhD in CS in 2005, and then immediately went into Apple working on LLVM, the compiler framework that you created during your PhD. In our prep, you also maybe had a favorite Scott Forstall story. [00:01:32] Chris: Well, so I got to work with a lot of really interesting people at Apple. Scott was actually pretty famous. Scott is responsible for many things across the years, but he really drove the iPhone. At leas
As alluded to on the pod, LangChain has just launched LangChain Hub: “the go-to place for developers to discover new use cases and polished prompts.” It’s available to everyone with a LangSmith account, no invite code necessary. Check it out! In 2023, LangChain has speedrun the race from 2:00 to 4:00 to 7:00 Silicon Valley Time. From the back to back $10m Benchmark seed and (rumored) $20-25m Sequoia Series A in April, to back to back critiques of “LangChain is Pointless” and “The Problem with LangChain” in July, to teaching with Andrew Ng and keynoting at basically every AI conference this fall (including ours), it has been an extreme rollercoaster for Harrison and his growing team creating one of the most popular (>60k stars at time of writing) building blocks for AI Engineers. LangChain’s Origins The first commit to LangChain shows its humble origins as a light wrapper around Python’s formatter.format for prompt templating. But as Harrison tells the story, even his first experience with text-davinci-002 in early 2022 was focused on chatting with data from their internal company Notion and Slack, what is now known as Retrieval Augmented Generation (RAG). As the Generative AI meetup scene came to life post Stable Diffusion, Harrison saw a need for common abstractions for what people were building with text LLMs at the time: * LLM Math, aka Riley Goodside’s “You Can’t Do Math” REPL-in-the-loop (PR #8) * Self-Ask With Search, Ofir Press’ agent pattern (PR #9) (later ReAct, PR #24) * NatBot, Nat Friedman’s browser controlling agent (PR #18) * Adapters for OpenAI, Cohere, and HuggingFaceHub All this was built and launched in a few days from Oct 16-25, 2022. Turning research ideas/exciting usecases into software quickly and often has been in the LangChain DNA from Day 1 and likely a big driver of LangChain’s success, to date amassing the largest community of AI Engineers and being the default launch framework for every big name from Nvidia to OpenAI: Dancing with Giants But AI Engineering is built atop of constantly moving tectonic shifts: * ChatGPT launched in November (“The Day the AGI Was Born”) and the API released in March. Before the ChatGPT API, OpenAI did not have a chat endpoint. In order to build a chatbot with history, you had to make sure to chain all messages and prompt for completion. LangChain made it easy to do that out of the box, which was a huge driver of usage. * Today, OpenAI has gone all-in on the chat API and is deprecating the old completions models, essentially baking in the chat pattern as the default way most engineers should interact with LLMs… and reducing (but not eliminating) the value of ConversationChains. * And there have been more updates since: Plugins released in API form as Functions in June (one of our top pods ever… reducing but not eliminating the value of OutputParsers) and Finetuning in August (arguably reducing some need for Retrieval and Prompt tooling). With each update, OpenAI and other frontier model labs realign the roadmaps of this nascent industry, and Harrison credits the modular design of LangChain in staying relevant. LangChain has not been merely responsive either: LangChain added Agents in November, well before they became the hottest topic of the AI Summer, and now Agents feature as one of LangChain’s top two usecases. LangChain’s problem for podcasters and newcomers alike is its sheer scope - it is the world’s most complete AI framework, but it also has a sprawling surface area that is difficult to fully grasp or document in one sitting. This means it’s time for the trademark Latent Space move (ChatGPT, GPT4, Auto-GPT, and Code Interpreter Advanced Data Analysis GPT4.5): the executive summary! What is LangChain? As Harrison explains, LangChain is an open source framework for building context-aware reasoning applications, available in Python and JS/TS. It launched in Oct 2022 with the central value proposition of “composability”, aka the idea that every AI engineer will want to switch LLMs, and combine LLMs with other things into “chains”, using a flexible interface that can be saved via a schema. Today, LangChain’s principal offerings can be grouped as: * Components: isolated modules/abstractions * Model I/O * Models (for LLM/Chat/Embeddings, from OpenAI, Anthropic, Cohere, etc) * Prompts (Templates, ExampleSelectors, OutputParsers) * Retrieval (revised and reintroduced in March) * Document Loaders (eg from CSV, JSON, Markdown, PDF) * Text Splitters (15+ various strategies for chunking text to fit token limits) * Retrievers (generic interface for turning an unstructed query into a set of documents - for self-querying, contextual compression, ensembling) * Vector Stores (retrievers that search by similarity of embeddings) * Indexers (sync documents from any source into a vector store without duplication) * Memory (for long running chats, whether a simple Buffer, Knowledge Graph, Summary, or Vector Store) * Use-Cases: compositions of Components * Chains: combining a PromptTemplate, LLM Model and optional OutputParser * with Router, Sequential, and Transform Chains for advanced usecases * savable, sharable schemas that can be loaded from LangChainHub * Agents: a chain that has access to a suite of tools, of nondeterministic length because the LLM is used as a reasoning engine to determine which actions to take and in which order. Notable 100LOC explainer here. * Tools (interfaces that an agent can use to interact with the world - preset list here. Includes things like ChatGPT plugins, Google Search, WolframAlpha. Groups of tools are bundled up as toolkits) * AgentExecutor (the agent runtime, basically the while loop, with support for controls, timeouts, memory sharing, etc) * LangChain has also added a Callbacks system for instrumenting each stage of LLM, Chain, and Agent calls (which enables LangSmith, LangChain’s first cloud product), and most recently an Expression Language, a declarative way to compose chains. LangChain the company incorporated in January 2023, announced their seed round in April, and launched LangSmith in July. At time of writing, the company has 93k followers, their Discord has 31k members and their weekly webinars are attended by thousands of people live. The full-featuredness of LangChain means it is often the first starting point for building any mainstream LLM use case, because they are most likely to have working guides for the new developer. Logan (our first guest!) from OpenAI has been a notable fan of both LangChain and LangSmith (they will be running the first LangChain + OpenAI workshop at AI Eng Summit). However, LangChain is not without its critics, with Aravind Srinivas, Jim Fan, Max Woolf, Mckay Wrigley and the general Reddit/HN community describing frustrations with the value of their abstractions, and many are attempting to write their own (the common experience of adding and then removing LangChain is something we covered in our Agents writeup). Harrison compares this with the timeless ORM debate on the value of abstractions. LangSmith Last month, Harrison launched LangSmith, their LLM observability tool and first cloud product. LangSmith makes it easy to monitor all the different primitives that LangChain offers (agents, chains, LLMs) as well as making it easy to share and evaluate them both through heuristics (i.e. manually written ones) and “LLM evaluating LLM” flows. The top HN comment in the “LangChain is Pointless” thread observed that orchestration is the smallest part of the work, and the bulk of it is prompt tuning and data serialization. When asked this directly our pod, Harrison agreed: “I agree that those are big pain points that get exacerbated when you have these complex chains and agents where you can't really see what's going on inside of them. And I think that's partially why we built Langsmith…” (48min mark) You can watch the full launch on the LangChain YouTube: It’s clear that the target audience for LangChain is expanding to folks who are building complex, production applications rather than focusing on the simpler “Q&A your docs” use cases that made it popular in the first place. As the AI Engineer space matures, there will be more and more tools graduating from supporting “hobby” projects to more enterprise-y use cases. In this episode we run through some of the history of LangChain, how it’s growing from an open source project to one of the highest valued AI startups out there, and its future. We hope you enjoy it! Show Notes * LangChain * LangChain’s Berkshire Hathaway Homepage * Abstractions tweet * LangSmith * LangSmith Cookbooks repo * LangChain Retrieval blog * Evaluating CSV Question/Answering blog and YouTube * MultiOn Partner blog * Harvard Sports Analytics Collective * Evaluating RAG Webinar * awesome-langchain: * LLM Math Chain * Self-Ask * LangChain Hub UI * “LangChain is Pointless” * Harrison’s links * sports - estimating player compatibility in the NBA * early interest in prompt injections * GitHub * Twitter Timestamps * [00:00:00] Introduction * [00:00:48] Harrison's background and how sports led him into ML * [00:04:54] The inspiration for creating LangChain - abstracting common patterns seen in other GPT-3 projects * [00:05:51] Overview of LangChain - a framework for building context-aware reasoning applications * [00:10:09] Components of LangChain - modules, chains, agents, etc. * [00:14:39] Underappreciated parts of LangChain - text splitters, retrieval algorithms like self-query * [00:18:46] Hiring at LangChain * [00:20:27] Designing the LangChain architecture - balancing flexibility and structure * [00:24:09] The difference between chains and agents in LangChain * [00:25:08] Prompt engineering and LangChain * [00:26:16] Announcing LangSmith * [00:30:50] Writing custom evaluators in LangSmith * [00:33:19] Reducing hallucinations - fixing retrieval vs generation issues *
The AI Engineer Summit Expo has been announced, presented by AutoGPT (and future guest Toran Bruce-Richards!) Stay tuned for more updates on the Summit livestream and Latent Space University. This post was on HN for 10 hours. What comes after the Transformer? This is one of the Top 10 Open Challenges in LLM Research that has been the talk of the AI community this month. Jon Frankle (friend of the show!) has an ongoing bet with Sasha Rush on whether Attention is All You Need, and the most significant challenger to emerge this year has been RWKV - Receptance Weighted Key Value models, which revive the RNN for GPT-class LLMs, inspired by a 2021 paper on Attention Free Transformers from Apple (surprise!). What this means practically is that RWKV models tend to scale in all directions (both in training and inference) much better than Transformers-based open source models: While remaining competitive on standard reasoning benchmarks: swyx was recently in Singapore for meetings with AI government and industry folks, and grabbed 2 hours with RWKV committee member Eugene Cheah for a deep dive, the full recording of which is now up on Latent Space TV: Today we release both the 2hr video and an edited 1hr audio version, to cater to the different audiences and provide “ablation opportunities” on RWKV interest level. The Eleuther Mafia? The RWKV project is notable not merely because of the credible challenge to the Transformers dominance. It is also a distributed, international, mostly uncredentialed community reminiscent of early 2020s Eleuther AI: * Primarily Discord, pseudonymous, GPU-poor volunteer community somehow coordinating enough to train >10B, OPT/BLOOM-competitive models * Being driven by the needs of its community, it is extremely polyglot (e.g. English, Chinese, Japanese, Arabic) not because it needs to beat some benchmarks, but because its users want it to be for their own needs. * “Open Source” in both the good and the bad way - properly Apache 2.0 licensed (not “open but restricted”), yet trained on data taken from commercially compromised sources like the Pile (where Shawn Presser’s Books3 dataset has been recently taken down) and Alpaca (taking from Steven Tey’s ShareGPT which is technically against OpenAI TOS) The threadboi class has loved tracking the diffusion of Transformers paper authors out into the industry: But perhaps the underdog version of this is tracking the emerging Eleuther AI mafia: It will be fascinating to see how both Eleuther and Eleuther alums fare as they build out the future of both LLMs and open source AI. Audio Version Timestamps assisted by smol-podcaster. Different timestamps vs the 2hr YouTube * [00:05:35] Eugene's path into AI at UIlicious * [00:07:33] Tokenizer penalty and data efficiency of Transformers * [00:08:02] Using Salesforce CodeGen * [00:10:17] The limitations of Transformers for handling large context sizes * [00:13:17] RWKV compute costs compared to Transformers * [00:16:06] How Eugene found RWKV early * [00:18:52] RWKV's focus on supporting many languages, not just English * [00:21:24] Using the RWKV model for fine-tuning for specific languages * [00:24:45] What is RWKV? * [00:33:46] Overview of the different RWKV models like World, Raven, Novel * [00:41:34] Background of Blink, the creator of RWKV * [00:49:55] The linear vs quadratic scaling of RWKV vs Transformers * [00:53:29] RWKV matching Transformer performance on reasoning tasks * [00:54:31] The community's lack of marketing for RWKV * [00:57:00] The English-language bias in AI models * [01:00:33] Plans to improve RWKV's memory and context handling * [01:03:10] Advice for AI engineers wanting to get more technical knowledge Show Notes Companies/Organizations: * RWKV - HF blog, paper, docs, GitHub, Huggingface * Raven 14B (finetuned on Alpaca+ShareGPT+...) Demo * World 7B (supports 100+ world languages) Demo * How RWKV works in 100 LOC, RWKV overview * EleutherAI - Decentralized open source AI research group * Stability AI - Creators of Stable Diffusion * Conjecture - Spun off from EleutherAI People: * Eugene Chia - CTO of UIlicious, member of RWKV committee (GitHub, Twitter) * Blink/Bo Peng - Creator of RWKV architecture * Quentin Anthony - our Latent Space pod on Eleuther, coauthor on RWKV * Sharif Shameem - our Latent Space pod on being early to Stable Diffusion * Tri Dao - our Latent Space pod on FlashAttention making Attention subquadratic * Linus Lee - our Latent Space pod in NYC * Jonathan Frankle - our Latent Space pod about Transformers longevity * Chris Re - Genius at Stanford working on state-space models * Andrej Karpathy - Zero to Hero series * Justine Tunney ("Justine.lol") - mmap trick Models/Papers: * Top 10 Open Challenges in LLM Research * Retentive Network: A Successor to Transformer for Large Language Models * GPT-NeoX - Open source replica of GPT-3 by EleutherAI * Salesforce CodeGen and CodeGen 2 * Attention Free Transformers paper * The Pile * RedPajama dataset * Monarch Mixer - Revisiting BERT, Without Attention or MLPs Misc Notes RWKV is not without known weaknesses - Transformers do well in reasoning because they are expressive in the forward pass, yet the RWKV docs already note that it is sensitive to prompt formatting and poor at lookback tasks. We also asked pointed questions about RWKV’s challenges in the full podcast. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Thanks to the almost 30k people who tuned in to the last episode! Your podcast cohosts have been busy shipping: * Alessio open sourced smol-podcaster, which makes the show notes here! * swyx launched GodMode. Maybe someday the Cursor of browsers? * We’re also helping organize a Llama Finetuning Hackameetup this Saturday in anticipation of the CodeLlama release. Lastly, more speakers were announced at AI Engineer Summit! 👀 ~46% of code typed through VS Code is written by Copilot. How do we get closer to 90+%? Aman Sanger says we need a brand new AI-powered IDE to get there; and we’re excited to be the first podcast ever to tell the Cursor story. If you haven’t heard of Cursor, you may have been living under a rock. Here are just some of the rave reviews going around in the past week alone: * “Cursor is the best product I've used in a while” - Alex MacCaw * “Someone finally put GPT into a code editor in a seamless way. It's so elegant and easy. No more copying and pasting.” - Andrew McCalip * “Coding with AI is getting insane.” - Mckay Wrigley * “This is mind blowing 🤯” - Linus Ekenstam * “Cursor + gpt4-32k = illegal levels of productivity” - Sully Omarr * “EL MEJOR EDITOR DE CÓDIGO con IA” - Carlos Santana A decade ago, “platform risk” meant building apps on social media platforms was risky as you could get cut off from the social network. Today, the AI version of “platform risk” is building AI products within an existing product (like an AI extension for VS Code, or a Figma plugin). Since Copilot, a generation of VSCode plugins have launched (including Cody, Cosine, and previous guests Codeium and Codium), only to be challenged by Copilot X itself. A core AI Engineering thesis is that new capabilities in AI demands new innovation in AI UX (and that AI UX can actually be a viable moat). Take VS Code for example; when Github was first working on Copilot, there was actually no way to support the “ghost autocomplete” feature we all use today. They eventually convinced the team to build it, and Copilot’s success speaks for itself. If you’re a startup building on top of VSC today, you do not have the same access and influence on the roadmap. Your UX is limited to what they allow you to do, and often that caps your ability to successfully compete against them. Since Cursor owns the whole IDE, they can do things you can’t (yet) do in VSCode: Cursor’s Gameplan Cursor is competing head to head against VS Code by forking Microsoft’s IDE and building their own AI-powered version. A few of Cursor’s unique features: * Native chat: Chat is a core piece of Cursor. Users can choose between GPT-3.5 and GPT-4 to ask questions and receive answers based on their code. * “Mentioning” files: you can easily add files into your request context by using “@”; this works both for code as well as documentation. If you want to do a change that includes multiple files, you can include them in your question to make sure the change is reflected in all of them. * Custom prompting engine: Cursor built Priompt, their custom prompting engine. As your chats go over the context window size, Priompt figures out which messages to keep in the history, which files to drop from the prompt, etc. * Moving beyond typing: while IDEs are familiar to folks as today’s interfaces, in the future Cursor hopes to have agents you can delegate tasks to. Instead of a back and forth on a new feature or bug fix, you can ask it to do the whole thing for you end to end. After diving deep into Cursor we nerded out on model usage, training, quantization, and evaluation. There’s a ton of great content in this episode, we hope you’ll enjoy it! As always, feedback welcome in the comments, and tag us on socials for future guest suggestions! Show Notes * Cursor * Gary Marcus’ cubes prompt * Priompt * “Humans should focus on bigger problems.” * Codium AI on Latent Space * Rift from Morph * Sourcegraph * E2B * Repl.it * HungryHungryHippos, Hyena, etc (see our FlashAttention episode) * Aman Tweets * Why GPT-3.5 is (mostly) cheaper than Llama 2 * Llama’s architectural limitations * “Training will look like researchers/practitioners offloading large-scale training jobs to specialized “training” companies: a state of the world that resembles chip design & fabrication.” - Mosaic prediction * “The size of all code/history on Github public repos is 92TB. The size of Google's monorepo in 2015 was 86TB (of much higher quality code). If Google were willing to deploy code models trained on their own data, they'd have a noticable advantage over everyone else.” - May 2023 Timestamps * [00:00:00] Intros * [00:02:31] Developing CAD models vs coding models * [00:05:23] Deciding to build a new IDE optimized for large language models * [00:10:50] Getting early access to GPT-4 and realizing its potential for software development * [00:12:32] Rethinking the UI/UX for coding * [00:18:24] Cursor's features like system prompts and chat * [00:22:24] Tips for prompting GPT-3/4 for code generation and editing * [00:27:24] Cursor's documentation and context features * [00:29:30] The potential of coding agents like Code Interpreter * [00:38:23] Cursor's internal prompting tool Priompt * [00:40:47] The challenges of very long context lengths for models * [00:45:44] The compute costs for prompt tokens vs. completion tokens * [00:49:36] How quantization interacts with model utilization * [00:51:24] Issues with human eval for benchmarking code models * [00:53:12] Thoughts on training models vs. relying on foundation models from big providers * [00:55:34] The origin story of Cursor's parent company AnySphere * [00:56:00] Lightning Round Transcript Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20] Swyx: Hey, and today we're back in the studio again after a little break and we have Aman Sanger in the house. Hey Aman. Hey, thanks for coming. Thanks for having me. So I wanted to introduce our guests and then have you fill in the blanks. So you worked at Gamelon, Bridgewater, McKinsey, Google, and You.com, all on sort of kind of AI related things and some finance related things. You also ran your own consultancy, Abelian AI, and you graduated in CS and math from MIT recently. Worked on a few projects, including Instill, which I think we'll cover a little bit later, and most recently Cursor.so, which we'll cover for the vast majority of the podcast. But just on a personal side, what's one thing that people should know about you that, you know, might not be so obvious on LinkedIn? Oh, interesting. [00:01:01] Aman: In a previous life, I played a lot of squash. [00:01:05] Swyx: You were a top seed? [00:01:06] Aman: Yeah. So in high school, I kind of competed in tournaments and most people probably don't really know what squash is. It's like tennis in many ways. It's like a racket sport, but it's indoors. You play against a wall. I guess now pickleball is all the rage with, with racket sports, but yeah, the story is I used to play tennis and then I moved to a building that had a squash court in it and then I picked it up. I loved it. And I've been playing ever since. So I competed a lot in high school, played a bunch at FIT, have not had the chance to play much here. In San Francisco, there aren't too many courts. [00:01:38] Swyx: We can organize a squash tournament and then you'll crush it, of course. Is there anything about the athlete mentality that you take with you as a founder? [00:01:47] Aman: Yeah, I think it can be at times a bit too much, but I'm very competitive. I really hate losing. Now I think I'll go on runs and if someone tries passing me, I won't let it happen. I'll just kick it into overdrive and maybe I'll turn the corner if I know they're going to beat me, but I can't let someone pass me when I'm running. And I think the same is true with starters, where the competitive nature, I think it in general helps motivate me and makes me, I guess, just work harder. [00:02:17] Swyx: Yeah. Okay. Well, we'll have a bunch of competitive questions later, but we'll go over the timeline. [00:02:22] Alessio: Let's jump into how you got to Cursor. So in August 2022, you launched something called Instill. Can you talk a little bit about that? [00:02:31] Aman: Yeah, and maybe before I go into Instill, I should talk about what I was even doing before that, because Instill was actually a very brief foray from what I was doing with my original co-founder, Michael. So we had both actually gone to the same high school together, gone to MIT together. And then after graduating, we knew we wanted to start something. And in June, what we were working on was also called Cursor, but very different. We basically were very, very fanatical users of Copilot. We loved it. And we had a little bit of experience with computer-aided design or CAD software. A lot of our friends, in fact, were mechanical engineers. And we'd heard a lot about how tedious it was to just design these parts and software like SOLIDWORKS and whatnot. It was pretty obvious to us that if you could train a transformer on the task of predicting the next token, not just for code, but for CAD, then you could get a really useful product that could speed up mechanical engineering. So that's actually what we'd worked on up until Instill, even a little bit after Instill. And yeah, I can go into more detail about that. It was pretty interesting. That's probably how, despite these days doing less stuff with model training than in the past. For that, it was all just kind of rolling our own models from scratch, a lot of training, a lot of inference. [00:03:48] Alessio: I'm always curious to hear about what made you interested in that. Obviously, you've been at the forefront of a lot of this AI work. Why was that the
Invites are going out for AI Engineer Summit! In the meantime, we have just announced our first Actually Open AI event with Brev.dev and Langchain, Aug 26 in our SF HQ (we’ll record talks for those remote). See you soon (and join the Discord)! Special thanks to @nearcyan for helping us arrange this with the Eleuther team. This post was on the HN frontpage for 15 hours. As startups and even VCs hoard GPUs to attract talent, the one thing more valuable than GPUs is knowing how to use them (aka, make GPUs go brrrr). There is an incredible amount of tacit knowledge in the NLP community around training, and until Eleuther.ai came along you pretty much had to work at Google or Meta to gain that knowledge. This makes it hard for non-insiders to even do simple estimations around costing out projects - it is well known how to trade $ for GPU hours, but trading “$ for size of model” or “$ for quality of model” is less known and more valuable and full of opaque “it depends”. This is why rules of thumb for training are incredibly useful, because they cut through the noise and give you the simple 20% of knowledge that determines 80% of the outcome derived from hard earned experience. Today’s guest, Quentin Anthony from EleutherAI, is one of the top researchers in high-performance deep learning. He’s one of the co-authors of Transformers Math 101, which was one of the clearest articulations of training rules of thumb. We can think of no better way to dive into training math than to have Quentin run us through a masterclass on model weights, optimizer states, gradients, activations, and how they all impact memory requirements. The core equation you will need to know is the following: Where C is the compute requirements to train a model, P is the number of parameters, and D is the size of the training dataset in tokens. This is also equal to τ, the throughput of your machine measured in FLOPs (Actual FLOPs/GPU * # of GPUs), multiplied by T, the amount of time spent training the model. Taking Chinchilla scaling at face value, you can simplify this equation to be `C = 120(P^2)`.These laws are only true when 1000 GPUs for 1 hour costs the same as 1 GPU for 1000 hours, so it’s not always that easy to make these assumptions especially when it comes to communication overhead. There’s a lot more math to dive into here between training and inference, which you can listen to in the episode or read in the articles. The other interesting concept we covered is distributed training and strategies such as ZeRO and 3D parallelism. As these models have scaled, it’s become impossible to fit everything in a single GPU for training and inference. We leave these advanced concepts to the end, but there’s a lot of innovation happening around sharding of params, gradients, and optimizer states that you must know is happening in modern LLM training. If you have questions, you can join the Eleuther AI Discord or follow Quentin on Twitter. Show Notes * Transformers Math 101 Article * Eleuther.ai * GPT-NeoX 20B * BLOOM * Turing NLG * Mosaic * Oak Ridge & Frontier Supercomputer * Summit Supercomputer * Lawrence Livermore Lab * RWKV * Flash Attention * Stas Bekman Timestamps * [00:00:00] Quentin's background and work at Eleuther.ai * [00:03:14] Motivation behind writing the Transformers Math 101 article * [00:05:58] Key equation for calculating compute requirements (tau x T = 6 x P x D) * [00:10:00] Difference between theoretical and actual FLOPs * [00:12:42] Applying the equation to estimate compute for GPT-3 training * [00:14:08] Expecting 115+ teraflops/sec per A100 GPU as a baseline * [00:15:10] Tradeoffs between Nvidia and AMD GPUs for training * [00:18:50] Model precision (FP32, FP16, BF16 etc.) and impact on memory * [00:22:00] Benefits of model quantization even with unlimited memory * [00:23:44] KV cache memory overhead during inference * [00:26:08] How optimizer memory usage is calculated * [00:32:03] Components of total training memory (model, optimizer, gradients, activations) * [00:33:47] Activation recomputation to reduce memory overhead * [00:38:25] Sharded optimizers like ZeRO to distribute across GPUs * [00:40:23] Communication operations like scatter and gather in ZeRO * [00:41:33] Advanced 3D parallelism techniques (data, tensor, pipeline) * [00:43:55] Combining 3D parallelism and sharded optimizers * [00:45:43] Challenges with heterogeneous clusters for distribution * [00:47:58] Lightning Round Transcription Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20] Swyx: Hey, today we have a very special guest, Quentin Anthony from Eleuther.ai. The context for this episode is that we've been looking to cover Transformers math for a long time. And then one day in April, there's this blog post that comes out that literally is called Transformers Math 101 from Eleuther. And this is one of the most authoritative posts that I've ever seen. And I think basically on this podcast, we're trying to give people an intuition around what are the rules of thumb that are important in thinking about AI and reasoning by AI. And I don't think there's anyone more credible than the people at Eleuther or the people training actual large language models, especially on limited resources. So welcome, Quentin. [00:00:59] Quentin: Thank you. A little bit about myself is that I'm a PhD student at Ohio State University, starting my fifth year now, almost done. I started with Eleuther during the GPT-NeoX20B model. So they were getting started training that, they were having some problems scaling it. As we'll talk about, I'm sure today a lot, is that communication costs and synchronization and how do you scale up a model to hundreds of GPUs and make sure that things progress quickly is really difficult. That was really similar to my PhD work. So I jumped in and helped them on the 20B, getting that running smoothly. And then ever since then, just as new systems challenges arise, and as they move to high performance computing systems and distributed systems, I just sort of kept finding myself falling into projects and helping out there. So I've been at Eleuther for a little bit now, head engineer there now, and then finishing up my PhD and then, well, who knows where I'll go next. [00:01:48] Alessio: Awesome. What was the inspiration behind writing the article? Was it taking some of those learnings? Obviously Eleuther is one of the most open research places out there. Is it just part of the DNA there or any fun stories there? [00:02:00] Quentin: For the motivation for writing, you very frequently see in like the DL training space, like these Twitter posts by like, for example, like Stas Bekman at Hugging Face, you'll see like a Twitter post that's like, oh, we just found this magic number and everything is like 20% faster. He’s super excited, but doesn't really understand what's going on. And the same thing for us, we very frequently find that a lot of people understand the theory or maybe the fundamentals of why like AI training or inference works, but no one knows like the nitty gritty details of like, how do you get inference to actually run correctly on your machine split across two GPUs or something like that. So we sort of had all of these notes that we had accumulated and we're sort of sharing among engineers within Eleuther and we thought, well, this would really help a lot of other people. It's not really maybe appropriate for like a paper, but for something like a blog post or technical report, this would actually maybe squeeze a lot of performance out of people's hardware they're already running on. So I guess there are a lot of projects in Eleuther that we're sort of trying to share notes with people in a way that typical institutions don't. They sort of live within that institution and then you go to a different institution and they do something very similar, but without the lessons of the previous. And it's because everyone's trying to do their own special sauce with their own stack. Whereas Eleuther, we don't really have that constraint and we can just share everything to everybody. [00:03:14] Swyx: Yeah, this is a level of openness that basically very few people actually embrace. One, it's an extra effort to write things down, of course, but two, it is secret sauce and so that not many people do it. And therefore, oftentimes the only way to learn this stuff is to actually work in one of the large model labs. And so you guys are doing a lot. The only other instance where I can think of where people actually open sourced their process was Facebook's OPT. What else is similar, like sort of trade knowledge, but not formal research knowledge? [00:03:45] Quentin: I would say Bloom. So the Hugging Face Bloom project in big science and all of that, that was very open. I'd say it's the same caliber, if not more detailed than OPT. Other than that, I think there was like a doc from Microsoft on like their Turing NLG. Their paper is pretty relaxed in that it did talk about some of those challenges. Other than like OPT and Bloom and us, I can't think of any. It's a new thing. [00:04:10] Swyx: It matters that you are going for the sort of good enough rules of thumb, because I think a lot of people try to go for precision and being overly precise actually is not helpful. Right. Yes. [00:04:20] Quentin: You'll see some like statements in the blog posts that are just like, we think this is about 1.2 in our experience. And, you know, we don't go any further into detail and it would take maybe an extra month for us to chase down every single little piece of memory. But instead, like getting good enough is still helpful to people. [00:04:36] Alessio: Let's jump into it. The first part of the article, and we'll put this in the show notes so people will be following along with the pos
We have just announced our first set of speakers at AI Engineer Summit! Sign up for the livestream or email [email protected] if you’d like to support. We are facing a massive GPU crunch. As both startups and VC’s hoard Nvidia GPUs like countries count nuclear stockpiles, tweets about GPU shortages have become increasingly common. But what if we could run LLMs with AMD cards, or without a GPU at all? There’s just one weird trick: compilation. And there’s one person uniquely qualified to do it. We had the pleasure to sit down with Tianqi Chen, who’s an Assistant Professor at CMU, where he both teaches the MLC course and runs the MLC group. You might also know him as the creator of XGBoost, Apache TVM, and MXNet, as well as the co-founder of OctoML. The MLC (short for Machine Learning Compilation) group has released a lot of interesting projects: * MLC Chat: an iPhone app that lets you run models like RedPajama-3B and Vicuna-7B on-device. It gets up to 30 tok/s! * Web LLM: Run models like LLaMA-70B in your browser (!!) to offer local inference in your product. * MLC LLM: a framework that allows any language models to be deployed natively on different hardware and software stacks. The MLC group has just announced new support for AMD cards; we previously talked about the shortcomings of ROCm, but using MLC you can get performance very close to the NVIDIA’s counterparts. This is great news for founders and builders, as AMD cards are more readily available. Here are their latest results on AMD’s 7900s vs some of top NVIDIA consumer cards. If you just can’t get a GPU at all, MLC LLM also supports ARM and x86 CPU architectures as targets by leveraging LLVM. While speed performance isn’t comparable, it allows for non-time-sensitive inference to be run on commodity hardware. We also enjoyed getting a peek into TQ’s process, which involves a lot of sketching: With all the other work going on in this space with projects like ggml and Ollama, we’re excited to see GPUs becoming less and less of an issue to get models in the hands of more people, and innovative software solutions to hardware problems! Show Notes * TQ’s Projects: * XGBoost * Apache TVM * MXNet * MLC * OctoML * CMU Catalyst * ONNX * GGML * Mojo * WebLLM * RWKV * HiPPO * Tri Dao’s Episode * George Hotz Episode People: * Carlos Guestrin * Albert Gu Timestamps * [00:00:00] Intros * [00:03:41] The creation of XGBoost and its surprising popularity * [00:06:01] Comparing tree-based models vs deep learning * [00:10:33] Overview of TVM and how it works with ONNX * [00:17:18] MLC deep dive * [00:28:10] Using int4 quantization for inference of language models * [00:30:32] Comparison of MLC to other model optimization projects * [00:35:02] Running large language models in the browser with WebLLM * [00:37:47] Integrating browser models into applications * [00:41:15] OctoAI and self-optimizing compute * [00:45:45] Lightning Round Transcript Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20] Swyx: Okay, and we are here with Tianqi Chen, or TQ as people call him, who is assistant professor in ML computer science at CMU, Carnegie Mellon University, also helping to run Catalyst Group, also chief technologist of OctoML. You wear many hats. Are those, you know, your primary identities these days? Of course, of course. [00:00:42] Tianqi: I'm also, you know, very enthusiastic open source. So I'm also a VP and PRC member of the Apache TVM project and so on. But yeah, these are the things I've been up to so far. [00:00:53] Swyx: Yeah. So you did Apache TVM, XGBoost, and MXNet, and we can cover any of those in any amount of detail. But maybe what's one thing about you that people might not learn from your official bio or LinkedIn, you know, on the personal side? [00:01:08] Tianqi: Let me say, yeah, so normally when I do, I really love coding, even though like I'm trying to run all those things. So one thing that I keep a habit on is I try to do sketchbooks. I have a book, like real sketchbooks to draw down the design diagrams and the sketchbooks I keep sketching over the years, and now I have like three or four of them. And it's kind of a usually a fun experience of thinking the design through and also seeing how open source project evolves and also looking back at the sketches that we had in the past to say, you know, all these ideas really turn into code nowadays. [00:01:43] Alessio: How many sketchbooks did you get through to build all this stuff? I mean, if one person alone built one of those projects, he'll be a very accomplished engineer. Like you built like three of these. What's that process like for you? Like it's the sketchbook, like the start, and then you think about the code or like. [00:01:59] Swyx: Yeah. [00:02:00] Tianqi: So, so usually I start sketching on high level architectures and also in a project that works for over years, we also start to think about, you know, new directions, like of course generative AI language model comes in, how it's going to evolve. So normally I would say it takes like one book a year, roughly at that rate. It's usually fun to, I find it's much easier to sketch things out and then gives a more like a high level architectural guide for some of the future items. Yeah. [00:02:28] Swyx: Have you ever published this sketchbooks? Cause I think people would be very interested on, at least on a historical basis. Like this is the time where XGBoost was born, you know? Yeah, not really. [00:02:37] Tianqi: I started sketching like after XGBoost. So that's a kind of missing piece, but a lot of design details in TVM are actually part of the books that I try to keep a record of. [00:02:48] Swyx: Yeah, we'll try to publish them and publish something in the journals. Maybe you can grab a little snapshot for visual aid. Sounds good. [00:02:57] Alessio: Yeah. And yeah, talking about XGBoost, so a lot of people in the audience might know it's a gradient boosting library, probably the most popular out there. And it became super popular because many people started using them in like a machine learning competitions. And I think there's like a whole Wikipedia page of like all state-of-the-art models. They use XGBoost and like, it's a really long list. When you were working on it, so we just had Tri Dao, who's the creator of FlashAttention on the podcast. And I asked him this question, it's like, when you were building FlashAttention, did you know that like almost any transform race model will use it? And so I asked the same question to you when you were coming up with XGBoost, like, could you predict it would be so popular or like, what was the creation process? And when you published it, what did you expect? We have no idea. [00:03:41] Tianqi: Like, actually, the original reason that we built that library is that at that time, deep learning just came out. Like that was the time where AlexNet just came out. And one of the ambitious mission that myself and my advisor, Carlos Guestrin, then is we want to think about, you know, try to test the hypothesis. Can we find alternatives to deep learning models? Because then, you know, there are other alternatives like, you know, support vector machines, linear models, and of course, tree-based models. And our question was, if you build those models and feed them with big enough data, because usually like one of the key characteristics of deep learning is that it's taking a lot [00:04:22] Swyx: of data, right? [00:04:23] Tianqi: So we will be able to get the same amount of performance. That's a hypothesis we're setting out to test. Of course, if you look at now, right, that's a wrong hypothesis, but as a byproduct, what we find out is that, you know, most of the gradient boosting library out there is not efficient enough for us to test that hypothesis. So I happen to have quite a bit of experience in the past of building gradient boosting trees and their variants. So Effective Action Boost was kind of like a byproduct of that hypothesis testing. At that time, I'm also competing a bit in data science challenges, like I worked on KDDCup and then Kaggle kind of become bigger, right? So I kind of think maybe it's becoming useful to others. One of my friends convinced me to try to do a Python binding of it. That tends to be like a very good decision, right, to be effective. Usually when I build it, we feel like maybe a command line interface is okay. And now we have a Python binding, we have R bindings. And then it realized, you know, it started getting interesting. People started contributing different perspectives, like visualization and so on. So we started to push a bit more on to building distributive support to make sure it works on any platform and so on. And even at that time point, when I talked to Carlos, my advisor, later, he said he never anticipated that we'll get to that level of success. And actually, why I pushed for gradient boosting trees, interestingly, at that time, he also disagreed. He thinks that maybe we should go for kernel machines then. And it turns out, you know, actually, we are both wrong in some sense, and Deep Neural Network was the king in the hill. But at least the gradient boosting direction got into something fruitful. [00:06:01] Swyx: Interesting. [00:06:02] Alessio: I'm always curious when it comes to these improvements, like, what's the design process in terms of like coming up with it? And how much of it is a collaborative with like other people that you're working with versus like trying to be, you know, obviously, in academia, it's like very paper-driven kind of research driven. [00:06:19] Tianqi: I would say the extra boost improvement at that time point was more on like, you know, I'm trying to figure out, right. But it's combining lessons. Before that, I did work on some of the other librarie
Our 3rd podcast feed swap with other AI pod friends! Check out Cognitive Revolution and Practical AI as well. NLW is the best daily AI YouTube/podcaster with the AI Breakdown. His summaries and content curation are spot on and always finds the interesting angle that will keep you thinking. Subscribe to the AI Breakdown wherever fine podcasts are sold! https://pod.link/1680633614 You can also watch on YouTube: Timestamps courtesy of summarize.tech The hosts discuss the launch of Code Interpreter as a separate model from OpenAI and speculate that it represents the release of GPT 4.5. People have found Code Interpreter to be better than expected, even for tasks unrelated to coding. They discuss the significance of this release, as well as the challenges of evaluating AI models, the cultural mismatch between researchers and users, and the increasing value of data in the AI industry. They also touch on the impact of open-source tools, the potential of AI companions, the advantages of Anthropics compared to other platforms, advancements in image recognition and multimodality, and predictions for the future of AI. * 00:00:00 In this section, the hosts discuss the launch of Code Interpreter from OpenAI and its significance in the development of the AI field. They explain that Code Interpreter, initially introduced as a plugin, is now considered a separate model with its own dropdown menu. They note that people have found Code Interpreter to be better than expected, even for tasks that are not related to coding. This leads them to speculate that Code Interpreter actually represents the release of GPT 4.5, as there has been no official announcement or blog post about it. They also mention that the AI safety concerns and regulatory environment may be impacting how OpenAI names and labels their models. Overall, they believe that Code Interpreter's release signifies a significant shift in the AI field and hints at the possibility of future advanced models like GPT 5. * 00:05:00 In this section, the speaker discusses the improvements in GPT 4.5 and how it enhances the experience for non-coding queries and inputs. They explain that the code interpreter feature allows for a wider range of use cases that were not possible with previous models like GPT 3.5. Additionally, they highlight the value of the code interpreter in assisting individuals with no coding experience to solve basic coding problems. This feature is likened to having a junior developer or intern analyst that aids in conducting tests and simplifies coding tasks. The speaker emphasizes that GPT 4.5 enables users to be more productive and efficient, especially when dealing with code-related challenges. They also discuss the future direction of AGI, where more time will be dedicated to inference rather than training, as this approach has shown significant improvements in terms of problem-solving. * 00:10:00 In this section, the speaker discusses how advanced AI models like GPT-4.5 are not just larger versions of previous models but rather employ fundamentally different techniques. They compare the evolution of AI models to the evolutionary timeline of humans, where the invention of tools opened up a whole new set of possibilities. They touch on the difficulty of evaluating AI models, particularly in more subjective tasks, and highlight how perceptions of model performance can be influenced by factors like formatting preferences. Additionally, the speaker mentions the challenges of reinforcement learning and the uncertainty around what the model is prioritizing in its suggestions. They conclude that OpenAI, as a research lab, is grappling with the complexities of updating models and ensuring reliability for users. * 00:15:00 In this section, the speaker discusses the cultural mismatch between OpenAI researchers and users of OpenAI's products, highlighting the conflicting statements made about model updates. They suggest that OpenAI needs to establish a policy that everyone can accept. The speaker also emphasizes the challenges of communication and the difficulty of serving different stakeholders. They mention the impact of small disruptions on workflows and the lack of immediate feedback within OpenAI's system. Additionally, the speaker briefly discusses the significance of OpenAI's custom instructions feature, stating that it allows for more personalization but is not fundamentally different from what other chat companies already offer. The discussion then transitions to Facebook's release of LAMA2, which holds significance both technically and for users, although further details on its significance are not provided in this excerpt. * 00:20:00 In this section, the introduction of GPT-4.5, also known as LAVA 2, is discussed. LAVA 2 is the first fully commercially usable GPT 3.5 equivalent model, which is a significant development because it allows users to run it on their own infrastructure and fine-tune it according to their needs. Although it is not fully open source, it presents new opportunities for various industries such as government, healthcare, and finance. The discussion also touches upon the open source aspect of LAVA 2, with the recognition that it has still contributed significantly to the community, as evidenced by the three million dollars' worth of compute and the estimated 15 to 20 million dollars' worth of additional fine-tuning capabilities it brings. The conversation acknowledges the value of open source models and data, while also recognizing the challenges and complexities in striking a balance between openness and restrictions.- * 00:25:00 In this section, the discussion centers around the commoditization of compute and the increasing value of data in the AI industry. While GPU compute is currently in high demand, it is observed that data is what holds the real value in AI. The conversation touches on the history of Open Source models and how the release of data for models like GPT J and GPT Neo signal a shift towards prioritizing data over model weights. The transcript also mentions the caution around data usage, citing examples of copyright concerns with datasets like Bookcorpus. The debate arises on whether ML engineers should proactively use open data or wait for permission, with some arguing for proactive usage to avoid holding back progress. The conversation also discusses the importance of terminology and protecting the definition of open source, while recognizing that the functional implications of open data are what matter most. * 00:30:00 In this section, the conversation revolves around the impact of open-source tools on companies and how it has influenced their approach to AI development. It is noted that companies can no longer just offer a nice user interface (UI) wrapper around an open AI model, as customers are demanding more. The competition has shifted towards other aspects of productionizing AI applications, which is seen as a positive development. The speaker predicts that OpenAI's competitive pressure will lead to opening up their source code and expects interesting advancements to emerge, such as running models locally for unlimited use. Additionally, the conversation touches on the potential of commercially available models, the application of new techniques, and the creativity unlocked by open source. The speaker also mentions the AI girlfriend economy, an area that is often overlooked but has millions of users and significant financial success. * 00:35:00 In this section, the speaker discusses their prediction about the long-term impact of AI on interpersonal relationships, suggesting that AI companions, such as AI girlfriends or boyfriends, could help address the loneliness crisis and reduce incidents of violence. They also mention the idea of using AI models to improve social interactions and communication skills. However, they highlight that this idea of AI companions may face resistance from older generations who may struggle to accept their legitimacy. The speaker also mentions an example of using AI models to create a mental wellness product in the form of a private journal. Overall, the speaker believes that while AI companions may have potential, they may not completely replace human relationships and interactions. * 00:40:00 In this section, the speaker discusses their views on Anthropics and the advantages it offers compared to other platforms. They mention that while Anthropics used to position themselves as the safer alternative to OpenAI, it was not appealing to many engineers. However, with the introduction of the 100K contest window and the ability to upload multiple files, Anthropics has become state-of-the-art in certain dimensions, such as latency and reliability in code synthesis. The speaker also notes that some businesses are choosing to build with the Anthropics API over OpenAI due to these advantages. They believe that Anthropics is finally finding its foothold after being overshadowed by OpenAI for a long time. Additionally, the speaker discusses their experience at the Anthropics hackathon, where they saw developer excitement for the platform. They believe that Anthropics is on its way up and that it paves the way for a multi-model future. However, they also acknowledge that the odds are stacked against Anthropics and that it needs more marketing support and community buy-in. Lastly, the speaker mentions the importance of running chats side by side against different models like Tracicia and GPT-4.5, and highlights that in their experience, Anthropics wins about 30% of the time, making it a valuable addition to one's toolkit. * 00:45:00 In this section, the discussion revolves around the advancements in image recognition and multimodality in language models like GPT-4.5. While there was some excitement about these developments, it was noted that relying on model updates alone may not be sufficient, and there is a need to focus on product-level improvements, such as integrating language mode
FlashAttention was first published by Tri Dao in May 2022 and it had a deep impact in the large language models space. Most open models you’ve heard of (RedPajama, MPT, LLaMA, Falcon, etc) all leverage it for faster inference. Tri came on the podcast to chat about FlashAttention, the newly released FlashAttention-2, the research process at Hazy Lab, and more. This is the first episode of our “Papers Explained” series, which will cover some of the foundational research in this space. Our Discord also hosts a weekly Paper Club, which you can signup for here. How does FlashAttention work? The paper is titled “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. There are a couple keywords to call out: * “Memory Efficient”: standard attention memory usage is quadratic with sequence length (i.e. O(N^2)). FlashAttention is sub-quadratic at O(N). * “Exact”: the opposite of “exact” in this case is “sparse”, as in “sparse networks” (see our episode with Jonathan Frankle for more). This means that you’re not giving up any precision. * The “IO” in “IO-Awareness” stands for “Input/Output” and hints at a write/read related bottleneck. Before we dive in, look at this simple GPU architecture diagram: The GPU has access to three memory stores at runtime: * SRAM: this is on-chip memory co-located with the actual execution core. It’s limited in size (~20MB on an A100 card) but extremely fast (19TB/s total bandwidth) * HBM: this is off-chip but on-card memory, meaning it’s in the GPU but not co-located with the core itself. An A100 has 40GB of HBM, but only a 1.5TB/s bandwidth. * DRAM: this is your traditional CPU RAM. You can have TBs of this, but you can only get ~12.8GB/s bandwidth, which is way too slow. Now that you know what HBM is, look at how the standard Attention algorithm is implemented: As you can see, all 3 steps include a “write X to HBM” step and a “read from HBM” step. The core idea behind FlashAttention boils down to this: instead of storing each intermediate result, why don’t we use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead? (We also talked about kernel fusion in our episode with George Hotz and how PyTorch / tinygrad take different approaches here) The result is much faster, but much harder to read: As you can see, FlashAttention is a very meaningful speed improvement on traditional Attention, and it’s easy to understand why it’s becoming the standard for most models. This should be enough of a primer before you dive into our episode! We talked about FlashAttention-2, how Hazy Research Group works, and some of the research being done in Transformer alternatives. Show Notes: * FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv) * FlashAttention-2 * Together AI * From Deep Learning to Long Learning * The Hardware Lottery by Sara Hooker * Hazy Research * Is Attention All You Need? * Nvidia CUTLASS 3 * SRAM scaling slows * Transformer alternatives: * S4 * Hyena * Recurrent Neural Networks (RNNs) Timestamps: * Tri's background [00:00:00] * FlashAttention’s deep dive [00:02:18] * How the Hazy Research group collaborates across theory, systems, and applications [00:17:21] * Evaluating models beyond raw performance [00:25:00] * FlashAttention-2 [00:27:00] * CUDA and The Hardware Lottery [00:30:00] * Researching in a fast-changing market [00:35:00] * Promising transformer alternatives like state space models and RNNs [00:37:30] * The spectrum of openness in AI models [00:43:00] * Practical impact of models like LLAMA2 despite restrictions [00:47:12] * Incentives for releasing open training datasets [00:49:43] * Lightning Round [00:53:22] Transcript: Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO-in-Residence at Decibel Partners. Today we have no Swyx, because he's in Singapore, so it's a one-on-one discussion with Tri Dao. Welcome! [00:00:24] Tri: Hi everyone. I'm Tri Dao, excited to be here. [00:00:27] Alessio: Tri just completed his PhD at Stanford a month ago. You might not remember his name, but he's one of the main authors in the FlashAttention paper, which is one of the seminal work in the Transformers era. He's got a lot of interest from efficient transformer training and inference, long range sequence model, a lot of interesting stuff. And now you're going to be an assistant professor in CS at Princeton next year. [00:00:51] Tri: Yeah, that's right. [00:00:52] Alessio: Yeah. And in the meantime, just to get, you know, a low pressure thing, you're Chief Scientist at Together as well, which is the company behind RedPajama. [00:01:01] Tri: Yeah. So I just joined this week actually, and it's been really exciting. [00:01:04] Alessio: So what's something that is not on the internet that people should know about you? [00:01:09] Tri: Let's see. When I started college, I was going to be an economist, so I was fully on board. I was going to major in economics, but the first week I was at Stanford undergrad, I took a few math classes and I immediately decided that I was going to be a math major. And that kind of changed the course of my career. So now I'm doing math, computer science, AI research. [00:01:32] Alessio: I had a similar thing. I started with physics and then I took like a programming course and I was like, I got to do computer science. I don't want to do physics. So FlashAttention is definitely, everybody's using this. Everybody loves it. You just released FlashAttention 2 last week. [00:01:48] Tri: Yeah. Early this week on Monday. Yeah. [00:01:53] Alessio: You know, AI time. Things move fast. So maybe let's run through some of the FlashAttention highlights, some of the innovation there, and then we can dive into FlashAttention 2. So the core improvement in FlashAttention is that traditional attention is a quadratic sequence length. And to the two, FlashAttention is linear, which obviously helps with scaling some of these models. [00:02:18] Tri: There are two factors there. So of course the goal has been to make attention go faster or more memory efficient. And ever since attention became popular in 2017 with the Transformer paper, lots and lots of folks have been working on this. And a lot of approaches has been focusing on approximating attention. The goal is you want to scale to longer sequences. There are tons of applications where you want to do that. But scaling to longer sequences is difficult because attention scales quadratically in sequence length on both runtime and memory, as you mentioned. So instead of trying to approximate attention, we were trying to figure out, can we do the same computation and maybe be more memory efficient? So in the end, we ended up being the memory is linear in sequence length. In terms of computation, it's still quadratic, but we managed to make it much more hardware friendly. And as a result, we do get wall clock speed up on the order of 2 to 4x, which really helps because that just means that you'll be able to train with 2 to 4x longer sequence length for the same cost without doing any approximations. As a result, lots of folks have been using this. The thing is available in a lot of libraries that do language model training or fine tuning. [00:03:32] Alessio: And the approximation thing is important because this is an exact thing versus a sparse. So maybe explain a little bit the difference there. [00:03:40] Tri: For sure. So in addition, essentially you compute pairwise similarity between every single element in a sequence against each other. So there's been other approaches where instead of doing all that pairwise computation, you only compute similarity for some pairs of elements in the sequence. So you don't do quadratic number of comparison. And this can be seen as some form of sparsity. Essentially you're ignoring some of the elements. When you write down the matrix, you essentially say, OK, I'm going to pretend there's zero. So that has some benefits in terms of runtime and memory. But the trade-off is that it tends to do worse in terms of quality because you're essentially approximating or ignoring some elements. And I personally have worked on this as well for a few years. But when we talk to practitioners who actually train models, especially at large scale, they say, tend not to use these approximate attention methods. Because it turns out, this was surprising to me at the time, was that these approximation methods, even though they perform fewer computation, they tend to not be faster in walk-on time. So this was pretty surprising because back then, I think my background was more on the theoretical side. So I was thinking of, oh, how many flops or floating point operations are you performing? And hopefully that correlates well with walk-on time. But I realized that I was missing a bunch of ideas from the system side where flops or floating point operations don't necessarily correlate with runtime. There are other factors like memory reading and writing, parallelism, and so on. So I learned a ton from just talking to systems people because they kind of figured this stuff out a while ago. So that was really eye-opening. And then we ended up focusing a lot more on memory reading and writing because that turned out to be the majority of the time when you're doing attention is reading and writing memory. [00:05:34] Alessio: Yeah, the I.O. awareness is probably one of the biggest innovations here. And the idea behind it is, like you mentioned, the FLOPS growth of the cards have been going up, but the memory bandwidth, not as much. So I think maybe that was one of the assumptions that the original attention paper had. So talk a bit about how that came to be as an idea. It's one of those things that like in insight, it's like, obviously, why are we like rewriting to like HBM every time, you know, and like once you change it,
As first discussed on our May Emergency pod and leaked 4 days ago, Llama (renamed from LLaMA) was upgraded to Llama 2 (pretraining on 2 trillion tokens with 2x the context length - bigger than any dataset discussed in Datasets 101, and adding ~$20m of RLHF/preference annotation) and released for commercial use on 18 July. It immediately displaced Falcon-40B as the leading open LLM and was immediately converted/quantized to GGML and other formats. Llama 2 seems to outperform all other open source models in their equivalent weight class: Why are open models important? The intersection of Open Source and AI is one of the oldest themes on this publication, and there has been a raging debate on the security and reliability of the OpenAI models and APIs. Users have reported GPT-4’s quality going down, which has been denied and denied and as of today, given some supporting data from Databricks, and complained about the API reliability and rapid deprecation schedules. Last and surely the biggest, there are entire classes of businesses and government/healthcare/military organizations that categorically cannot send any of their sensitive data to an external API provider, even if it is OpenAI through Azure. The only way to have total control is to own and serve your own models, which Llama 2 now pushes forward in terms of the state of the art (your own GPT3.5-quality model, though it is nowhere near Claude 2 or GPT-4). As we do with breaking news, we got on to Twitter Spaces again to chat with two scheduled guests: * Nathan Lambert, ML Researcher at Huggingface and author of Interconnects who had the best summary of the Llama2 paper * Matt Bornstein, organizer of the a16z infra team that launched Llama2.ai (source here) and has been coding up a storm with AI demo apps, unusual for VCs as well as Anton Troynikov of Chroma, Russell Kaplan of Scale AI, and Omar Qazi of the Whole Mars Catalog. Enjoy! Show Notes * Official links * Website, Paper * GitHub (Llama 2 commit) * Azure Partnership * Use policy, Statement of Support for Open Approach * Where to try * Llama2.ai (source), Perplexity Llama Chat * Live playground/API on Replicate, deploy all versions on Baseten * https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI * Dev ports - simonw llm-replicate, ggml using llama.cpp (7B, 13B) or pinokio, ollama, Core ML port * Timeline * 24 Feb - LLaMA 1 announced * 6 May - our No Moats podcast - first mention of Zuck opening up Llama * 14 July - Llama 2 leaked * 18 July - Llama 2 announced * Community notes * Nathan’s research paper recap * 638 LOC, 4 dependencies * Usage restrictions - MAU restriction, derivative models * Grouped Query Attention * System prompt * 2 trillion token dataset * >$20m price tag (rlhf, jimfan), * Separate models for safety and helpfulness (jimfan) * Mistral AI founders left out of paper * Interesting fails: Timestamps * [00:02:30] Introducing the speakers * [00:03:32] Nathan Lambert intro * [00:04:48] General Summary of Llama 2 * [00:05:57] Sarah Silverman killed Dataset Transparency? * [00:08:48] Simon's Recap of Llama 2 * [00:11:43] Matt's Intro * [00:12:59] a16z Infra's new AI team? * [00:15:10] Alessio's recap of Llama 2 * [00:17:26] Datasets 101 Followup * [00:18:14] Context Length 4k * [00:20:35] Open-ish Source? Usage Policy and Restrictions * [00:23:38] Huggingface Responsible AI License * [00:24:57] Pretraining Llama 2 Base Model beyond Chinchilla * [00:29:55] Llama 2 is incomplete? Race to publish * [00:31:40] Come for the Llama, stay for the (Meta) drama * [00:33:22] Language Translation * [00:35:10] Llama2's coding abilities * [00:35:59] Why we want to know about the training data * [00:37:45] The importance of Meta pushing forward Truly Open AI * [00:40:59] Llama 2 as Enabler of Startups * [00:43:59] Where you can try Llama 2 * [00:44:25] Do you need dataset transparency if you have evals? * [00:45:56] >$20m cost of Llama 2 is primarily preference data collection * [00:48:59] Do we even need human annotators? * [00:49:42] Models Rating Models * [00:53:32] How to get Code preference data * [00:54:34] Llama 2 Finetuning Ecosystem * [00:56:32] Hey Apple: Llama2 on Metal pls * [00:57:17] Llama 2 and Chroma * [01:00:15] Open Source MoE model? * [01:00:51] Llama 2 using tools * [01:01:40] Russell Kaplan on Scale AI's Llama 2 plans * [01:03:31] Scale annotating code? * [01:04:36] Immortality * [01:04:59] Running Llama on your phone * [01:06:54] Sama * [01:10:58] Meta "Open Source" Leadership * [01:11:56] Prediction: Finetuning => New Use Cases from Internal State * [01:13:54] Prediction: Llama Toolformer * [01:14:39] Prediction: Finetune-for-everything * [01:15:50] Predictions: Llama Agents * [01:16:35] dP(Doom)? * [01:19:21] Wrapping up Transcript [00:00:00] Introducing the speakers [00:00:00] Alessio Fanelli: There's not a single dull day in this space. I think when we started the podcast in January, a lot of people asked us, how long can you really do this? Just focusing on AI research and, and models. And I think the, the answer is clear now. A long time. So excited for this and excited to have Simon again. [00:00:16] You're basically a honorary guest host of all of our Twitter spaces. Cool. Thank you. [00:00:21] Simon Willison: No, it's great to be here again. [00:00:23] Alessio Fanelli: And Nathan, thanks for joining us. Actually share your your writeup on, on Lama two technical details with Swyx this morning. So it's great to to have you here to dive into some of the details. [00:00:33] Nathan Lambert: Yeah, sounds good. As probably clear Huggingface was trying to collaborate on releasing the model on the platform. So we ended up getting some early details, which made it a lot easier for me to cram study before the chaos hit. [00:00:48] Alessio Fanelli: No, that's great. It, it's kind of what happened with the code interpreter episode when Sean and I had access for about five hours and Simon was like, I've been playing with this for weeks and add all the, the insights scoops. [00:00:59] So I think this will be a, a good episode. [00:01:02] Nathan Lambert intro [00:01:02] Alessio Fanelli: Maybe Nathan, you just want to give people a little bit of background on what you do at Hugging and Face and yeah, the, your experience with the LAMA two kinda preview. Yeah. So [00:01:12] Nathan Lambert: I've been a researcher and helping lead reinforcement learning from human feedback efforts at Hugging and face, which really means I do some research and I try to figure out how to fine tune models to do what people want. [00:01:26] Generally we're trying to operate in the scale a little bit smaller than what Meta is doing cuz we obviously don't have that kind of resources at a startup. So I do a lot of technical research and also try to actually engage and communicate that with the community and specifically, Llama, I think I was most interested on kind of the research side. [00:01:48] I think the paper is a phenomenal artifact and it's clear that the model is really strong in a lot of areas. And then kind of the big picture trends of where open source is going. Like this is a clear step in a direction that a lot of people wanted, but weren't sure if it was gonna happen. Yep. [00:02:04] Alessio Fanelli: What are some of the things that stood out to you? [00:02:06] I think to a lot of the AI engineers audience that we have, they're not as deep into the details of the papers. We'd love to get a, a read from somebody like you who's a much deeper at a, you know, model research level. [00:02:18] General Summary of Llama 2 [00:02:18] Nathan Lambert: Yeah. It's like, where do I start? So I think as a general summary, the paper includes a lot of details on methodology. So like, what are the things that they did in their stack to build, to actually run this? And it misses a lot of details on. What does a specific data set actually look like? It's clear that they have a really fine-tuned data set and they paid a lot of money for these data sets. [00:02:46] I think may like, it seems like now that both surge and scale are claiming some part in it, which I find hilarious. Cause it's really unclear, which are two of the probably biggest data labeling firms. So they kind of took the approach, meta took the approach of starting with open source preference data and then added a lot onto it. [00:03:04] And the most interesting part to me on this preference data, which is a new technical approach, is they trained two preference models, two reward models, one toward making the model helpful and one for making the model safe. And then in terms of open source models, it's clearly more performant on kind of ground root benchmarks and then it's safer. [00:03:27] Sarah Silverman killed Dataset Transparency? [00:03:27] swyx: That's where I was [00:03:28] Simon Willison: gonna wrap up to clarify, right. This is a big difference from the first LAMA paper. Cause the first LAMA paper was very, was so detailed in terms of how the training data worked, that people were able to essentially replicate it. And so you're saying that this new paper, there's, there's much less transparency as to how the training worked [00:03:45] Nathan Lambert: on the DIS side. [00:03:46] Yeah, I think they, they did a lot of new methodological things to, so taking the time to explain that like is not as much of a data focused paper. There's no table that is like, this is what the distribution of pre-training data came from. I would guess that it's a similar data set to the original llama with the kind of, they mentioned like one of the details that's really interesting is that they mentioned they up weight high factuality content. [00:04:14] So things that probably seem like Wikipedia, that seems like they're doing some sort of up ranking. During base model training, but they don't de, they did some type of thing they didn't detail [00:04:24] swyx: because it's also [00:04:25] Simon Willison: worth mentioning, I mean, they're being
In April, we released our first AI Fundamentals episode: Benchmarks 101. We covered the history of benchmarks, why they exist, how they are structured, and how they influence the development of artificial intelligence. Today we are (finally!) releasing Datasets 101! We’re really enjoying doing this series despite the work it takes - please let us know what else you want us to cover! Stop me if you’ve heard this before: “GPT3 was trained on the entire Internet”. Blatantly, demonstrably untrue: the GPT3 dataset is a little over 600GB, primarily on Wikipedia, Books corpuses, WebText and 2016-2019 CommonCrawl. The Macbook Air I am typing this on has more free disk space than that. In contrast, the “entire internet” is estimated to be 64 zetabytes, or 64 trillion GB. So it’s more accurate to say that GPT3 is trained on 0.0000000001% of the Internet. Why spend $5m on GPU time training on $50 worth of data? Simple: Garbage in, garbage out. No matter how good your algorithms, no matter how much money/compute you have, your model quality is strongly determined by the data you train it on and research scientists think we just don’t need or have that much high quality data. We spend an enormous amount of effort throwing out data to keep the quality high, and recently Web 2.0-era UGC platforms like StackOverflow, Reddit, and Twitter clamped down on APIs as they realize the goldmines they sit on. Data is the new new oil. Time for a primer! Show Notes * Our 2 months worth of podcast prep notes! * The Token Crisis paper * Ilya Sutskever on datasets * OpenAI Tokenizer * Kaplan Scaling Laws Lecture * Chinchilla Paper * Sasha Rush’s Tweet * Karpathy’s Build Conference Presentation * LIMA Paper * Phi-1 by Microsoft * Washington Post Article on datasets * Our episode with Jonathan Frankle * Our episode with Mike Conover * BloombergGPT * Datasets * HuggingFace Hub * CommonCrawl, Overview * C4 * List of Dirty, Naughty, Obscene, and Otherwise Bad Words * OpenWebText * books3 * OpenAssistant * The Stack * The Pile * LAION * Audio: * LibriSpeech: A dataset of audio recordings of audiobooks * CommonVoice: A dataset of audio recordings of people speaking different languages * Voxforge: A dataset of audio recordings of people speaking different languages * Switchboard: A dataset of audio recordings of telephone conversations * Fisher Corpus: A dataset of audio recordings of news broadcasts * Chinese: * CMRC (Chinese Machine Reading Comprehension 2018) * DuReader * ChID * Copyright & Privacy: * https://stablediffusionlitigation.com/ * https://haveibeentrained.com/ * https://githubcopilotlitigation.com/ * https://twitter.com/moyix/status/1662131770463072257 * OpenAI Opt Out Process * Check if you’re in The Stack * Deduplication * Deduplicating Training Data Makes Language Models Better * Deduplicating Training Data Mitigates Privacy Risks in Language Models * Contamination * CodeForces example This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Code Interpreter is GA! As we do with breaking news, we convened an emergency pod and >17,000 people tuned in, by far our most biggest ever. This is a 2-for-1 post - a longform essay with our trademark executive summary and core insights - and a podcast capturing day-after reactions. Don’t miss either of them! Essay and transcript: https://latent.space/p/code-interpreter Podcast Timestamps [00:00:00] Intro - Simon and Alex [00:07:40] Code Interpreter for Edge Cases [00:08:59] Code Interpreter's Dependencies - Tesseract, Tensorflow [00:09:46] Code Interpreter Limitations [00:10:16] Uploading Deno, Lua, and other Python Packages to Code Interpreter [00:11:46] Code Interpreter Timeouts and Environment Resets [00:13:59] Code Interpreter for Refactoring [00:15:12] Code Interpreter Context Window [00:15:34] Uploading git repos [00:16:17] Code Interpreter Security [00:18:57] Jailbreaking [00:19:54] Code Interpreter cannot call GPT APIs [00:21:45] Hallucinating Lack of Capability [00:22:27] Code Interpreter Installed Libraries and Capabilities [00:23:44] Code Interpreter generating interactive diagrams [00:25:04] Code Interpreter has Torch and Torchaudio [00:25:49] Code Interpreter for video editing [00:27:14] Code Interpreter for Data Analysis [00:28:14] Simon's Whole Foods Crime Analysis [00:31:29] Code Interpreter Network Access [00:33:28] System Prompt for Code Interpreter [00:35:12] Subprocess run in Code Interpreter [00:36:57] Code Interpreter for Microbenchmarks [00:37:30] System Specs of Code Interpreter [00:38:18] PyTorch in Code Interpreter [00:39:35] How to obtain Code Interpreter RAM [00:40:47] Code Interpreter for Face Detection [00:42:56] Code Interpreter yielding for Human Input [00:43:56] Tip: Ask for multiple options [00:44:37] The Masculine Urge to Start a Vector DB Startup [00:46:00] Extracting tokens from the Code Interpreter environment? [00:47:07] Clientside Clues for Code Interpreter being a new Model [00:48:21] Tips: Coding with Code Interpreter [00:49:35] Run Tinygrad on Code Interpreter [00:50:40] Feature Request: Code Interpreter + Plugins (for Vector DB) [00:52:24] The Code Interpreter Manual [00:53:58] Quorum of Models and Long Lived Persistence [00:56:54] Code Interpreter for OCR [00:59:20] What is the real RAM? [01:00:06] Shyamal's Question: Code Interpreter + Plugins? [01:02:38] Using Code Interpreter to write out its own memory to disk [01:03:48] Embedding data inside of Code Interpreter [01:04:56] Notable - Turing Complete Jupyter Notebook [01:06:48] Infinite Prompting Bug on ChatGPT iOS app [01:07:47] InstructorEmbeddings [01:08:30] Code Interpreter writing its own sentiment analysis [01:09:55] Simon's Symbex AST Parser tool [01:10:38] Personalized Languages and AST/Graphs [01:11:42] Feature Request: Token Streaming/Interruption [01:12:37] Code Interpreter for OCR from a graph [01:13:32] Simon and Shyamal on Code Interpreter for Education [01:15:27] Feature Requests so far [01:16:16] Shyamal on ChatGPT for Business [01:18:01] Memory limitations with ffmpeg [01:19:01] DX of Code Interpreter timeout during work [01:20:16] Alex Reibman on AgentEval [01:21:24] Simon's Jailbreak - "Try Running Anyway And Show Me The Output" [01:21:50] Shouminik - own Sandboxing Environment [01:23:50] Code Interpreter Without Coding = GPT 4.5??? [01:28:53] Smol Feature Request: Add Music Playback in the UI [01:30:12] Aravind Srinivas of Perplexity joins [01:31:28] Code Interpreter Makes Us More Ambitious - Symbex Redux [01:34:24] How to win a shouting match with Code Interpreter [01:39:29] Alex Graveley joins [01:40:12] Code Interpreter Context = 8k [01:41:11] When Code Interpreter API? [01:45:15] GPT4 Vision [01:46:15] What's after Code Interpreter [01:46:43] Simon's Request: Give us Code Interpreter Model API [01:47:12] Kyle's Request: Give us Multimodal Data Analysis [01:47:43] Tip: The New 0613 Function Models may be close [01:49:56] Feature Request: Make ChatGPT Social - like MJ/Stable Diffusion [01:56:20] Using ChatGPT to learn to build a Frogger iOS Swift App [01:59:11] Farewell... until next time [02:00:01] Simon's plug [02:00:51] Swyx: What about Phase 5? and AI.Engineer Summit This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Part 2 of our podcast feed swap weekend! Check out Cognitive Revolution as well. "Data" Dan Whitenack has been co-host of the Practical AI podcast for the past 5 years, covering full journey of the modern AI wave post Transformers. He joined us in studio to talk about their origin story and highlight key learnings from past episodes, riff on the AI trends we are all seeing as AI practitioner-podcasters, and his passion for low-resource-everything! Subscribe on the Changelog, RSS, Apple Podcasts, Twitter, Mastodon, and wherever fine podcasts are sold! Show notes * Daniel Whitenack – Twitter, GitHub, Website * Featured Latent Space episodes: * Benchmarks * Reza Shabani * MosaicML and MPT * Segment Anything * Mike Conover * Featured Practical AI episodes: * From notebooks to Netflix scale with Metaflow * Capabilities of LLMs 🤯 * ML at small organizations * Prediction Guard * Data Dan Timestamps * 00:00 Welcome to Practical AI * 01:16 Latent Space Podcast * 04:00 Practical AI Podcast * 06:20 Prediction Guard * 08:05 Daniel's favorite episodes * 10:21 Alessio's favorite episode * 10:54 Swyx's favorite episode * 12:44 Listener favorites * 15:14 LLMOps * 17:06 Reza Shabani * 19:06 Benchmarks 101 * 20:06 Roboflow * 21:38 Mode collapse * 26:21 Rajiv Shah * 28:01 Staying on top of things * 33:11 Kirsten Lum * 34:31 datadan.io * 38:48 Prompt engineering * 40:38 Unique challenges engineers face * 42:51 AI-UX * 45:31 NLP data sets * 50:49 Unlabeled data sets * 55:07 Lightning round! * 55:20 What's already happened in AI? * 56:27 Unsolved questions in AI * 58:01 Get hands on * 58:53 Outro Transcript Full transcript is over at the Changelog site! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Thanks to the over 1m people that have checked out the Rise of the AI Engineer. It’s a long July 4 weekend in the US, and we’re celebrating with a podcast feed swap! We’ve been big fans of Nathan Labenz and Erik Torenberg’s work at the Cognitive Revolution podcast for a while, which started around the same time as we did and has done an incredible job of hosting discussions with top researchers and thinkers in the field, with a wide range of topics across computer vision (a special focus thanks to Nathan’s work at Waymark), GPT-4 (with exceptional insight due to Nathan’s time on the GPT-4 “red team”), healthcare/medicine/biotech (Harvard Medical School, Med-PaLM, Tanishq Abraham, Neal Khosla), investing and tech strategy (Sarah Guo, Elad Gil, Emad Mostaque, Sam Lessin), safety and policy, curators and influencers and exceptional AI founders (Josh Browder, Eugenia Kuyda, Flo Crivello, Suhail Doshi, Jungwon Byun, Raza Habib, Mahmoud Felfel, Andrew Feldman, Matt Welsh, Anton Troynikov, Aravind Srinivas). If Latent Space is for AI Engineers, then Cognitive Revolution covers the much broader field of AI in tech, business and society at large, with a longer runtime to go deep on research papers like TinyStories. We hope you love this episode as much as we do, and check out CogRev wherever fine podcasts are sold! Subscribe to the Cognitive Revolution on: * Website * Apple Podcasts * Spotify * Youtube Good Data is All You Need The work of Ronen and Yuanzhi echoes a broader theme emerging in the midgame of 2023: * Falcon-40B (trained on 1T tokens) outperformed LLaMA-65B (trained on 1.4T tokens), primarily due to the RefinedWeb Dataset that runs CommonCrawl through extensive preprocessing and cleaning in their MacroData Refinement pipeline. * UC Berkeley LMSYS’s Vicuna-13B is near GPT-3.5/Bard quality at a tenth of their size, thanks to fine-tuning from 70k user-highlighted ChatGPT conversations (indicating some amount of quality). * Replit’s finetuned 2.7B model outperforms the 12B OpenAI Codex model based on HumanEval, thanks to high quality data from Replit users The path to smaller models leans on better data (and tokenization!), whether from cleaning, from user feedback, or from synthetic data generation, i.e. finetuning high quality on outputs from larger models. TinyStories and Phi-1 are the strongest new entries in that line of work, and we hope you’ll pick through the show notes to read up further. Show Notes * TinyStories (Apr 2023) * Paper: TinyStories: How Small Can Language Models Be and Still Speak Coherent English? * Internal presentation with Sebastien Bubeck at MSR * Twitter thread from Ronen Eldan * Will future LLMs be based almost entirely on synthetic training data? In a new paper, we introduce TinyStories, a dataset of short stories generated by GPT-3.5&4. We use it to train tiny LMs (* Phi-1 (Jun 2023) * Paper: Textbooks are all you need (HN discussion) * Twitter announcement from Sebastien Bubeck: * phi-1 achieves 51% on HumanEval w. only 1.3B parameters & 7B tokens training dataset and 8 A100s x 4 days = 800 A100-hours. Any other >50% HumanEval model is >1000x bigger (e.g., WizardCoder from last week is 10x in model size and 100x in dataset size). This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
We are now launching our dedicated new YouTube and Twitter! Any help in amplifying our podcast would be greatly appreciated, and of course, tell your friends! Notable followon discussions collected on Twitter, Reddit, Reddit, Reddit, HN, and HN. Please don’t obsess too much over the GPT4 discussion as it is mostly rumor; we spent much more time on tinybox/tinygrad on which George is the foremost authority! We are excited to share the world’s first interview with George Hotz on the tiny corp! If you don’t know George, he was the first person to unlock the iPhone, jailbreak the PS3, went on to start Comma.ai, and briefly “interned” at the Elon Musk-run Twitter. Tinycorp is the company behind the deep learning framework tinygrad, as well as the recently announced tinybox, a new $15,000 “luxury AI computer” aimed at local model training and inference, aka your “personal compute cluster”: * 738 FP16 TFLOPS * 144 GB GPU RAM * 5.76 TB/s RAM bandwidth * 30 GB/s model load bandwidth (big llama loads in around 4 seconds) * AMD EPYC CPU * 1600W (one 120V outlet) * Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks) (In the episode, we also talked about the future of the tinybox as the intelligence center of every home that will help run models, at-home robots, and more. Make sure to check the timestamps 👀 ) The tiny corp manifesto There are three main theses to tinycorp: * If XLA/PrimTorch are CISC, tinygrad is RISC: CISC (Complex Instruction Set Computing) are more complex instruction sets where a single instruction can execute many low-level operations. RISC (Reduced Instruction Set Computing) are smaller, and only let you execute a single low-level operation per instruction, leading to faster and more efficient instruction execution. If you’ve used the Apple Silicon M1/M2, AMD Ryzen, or Raspberry Pi, you’ve used a RISC computer. * If you can’t write a fast ML framework for GPU, you can’t write one for your own chip: there are many “AI chips” companies out there, and they all started from taping the chip. Some of them like Cerebras are still building, while others like Graphcore seem to be struggling. But building chips with higher TFLOPS isn’t enough: “There’s a great chip already on the market. For $999, you get a 123 TFLOP card with 24 GB of 960 GB/s RAM. This is the best FLOPS per dollar today, and yet…nobody in ML uses it.”, referring to the AMD RX 7900 XTX. NVIDIA’s lead is not only thanks to high-performing cards, but also thanks to a great developer platform in CUDA. Starting with the chip development rather than the dev toolkit is much more cost-intensive, so tinycorp is starting by writing a framework for off-the-shelf hardware rather than taping their own chip. * Turing completeness considered harmful: Once you call in to Turing complete kernels, you can no longer reason about their behavior. Since they have to be able to execute any instruction, they are much more complex. To optimize Turing kernels performance, you fall back to caching, warp scheduling, and branch prediction. Since neural networks only need ADD/MUL operations and only rely on static memory accesses, there’s no need to have Turing completeness. This design decision allows tinygrad to optimize instructions at a much lower level. As you might have guessed, CUDA is Turing-complete; this is one of the main differences that tinycorp wants to leverage to be competitive. All that — covered in the first 10 minutes of our discussion. George came ready to go deep, so we went for it. Some of the other technical questions we went through: * Laziness: why laziness is important and how operation fusing can help with memory efficiency * Debugging & CI: Why great developer experience is a priority in tinygrad * Quantization: what’s the right level of quantization, how lossless are these transformations, his quick takes on Mojo and ggml, and why fp16 is the target for their out-of-the-box LLaMA. * Building rigs for individual use: we talked a bit about the design tradeoffs of building these machines with low noise and a single power plug, the difference that PCIe 4 vs 3 makes, and more. The “personal compute cluster” is $15,000, but for businesses interested in local training and inference, George also estimates that he will be able to build you a H100-class GPU that is 5-10x faster (than a H100) for the same price. Misc: Bitter Lessons, Core Insights, Remote Work Outside of tiny, we also talked about one of George’s favorite units of measure “a person of compute”. Much of the AGI talk has been benchmark-driven, but looking at it from a compute throughput can also be interesting. One person of compute is roughly 20 PFLOPS (64 A100s, or a single dense 42U A100 rack); one A100 is ~$10-15,000, so the GPUs by themselves will come out at $640,000-$1,000,000. We also covered a wide range of topics, including his self analysis on GPT-4, Elon Musk, Remote Work, Computer Vision and the Comma Body, and life above/below the API (and above/below the Kanban board). See show notes and timestamps for more! Show Notes * “Unlocked iPhone Traded for Nissan 350Z” * “Unlocked iPhone” on YouTube (August 21st, 2007) * “The Light It Up Contest” on YouTube (February 13th, 2011) * Comma.ai * NHTSA cease and desist * The Hero’s Journey * The Portal Story * A Person of Compute * Above / Below the API Line (swyx take) * The Bitter Lesson * The Goddess of Everything Else (listen to George read it) * Meditations on Moloch * George’s email to Lisa Su, AMD’s CEO: Timestamps * [00:00:00] Intros & tinygrad’s “Portal Story” * [00:03:00] Thesis #1 * [00:03:50] Thesis #2 * [00:05:00] Thesis #3 + Turing completeness discussion * [00:10:00] tinygrad’s creation and core ideas * [00:16:00] Operation fusing in tinygrad * [00:17:00] Debugging & profiling in tinygrad * [00:18:30] Tinygrad vs Pytorch competitiveness * [00:20:30] geohot vs AMD * [00:25:00] On ggml * [00:26:00] Tinygrad’s CI philosophy * [00:26:30] On Mojo * [00:28:00] ggml quantization is made up * [00:31:00] Work for tiny: benchmark int8 vs fp16 * [00:33:00] Why you can’t build tinybox - Design constraints * [00:35:00] The Personal Compute Cluster * [00:37:00] Shoutout to our MosaicML podcast * [00:39:00] FLOPcoin and other use cases for the tinybox * [00:43:00] Rumors on GPT-4 architecture * [00:46:00] The Bitter Lesson * [00:48:00] Hiring and Changing mind on remote work * [00:52:00] Above/Below The API * [00:55:40] Comma Bodies & Computer Vision * [00:58:40] Merging with the machine and AI girlfriends * [01:02:00] Is AI gonna kill us all? * [01:09:00] Why Avatar 2 was bad Transcript Swyx: Hey everyone, welcome to the Latent Space podcast. This is Swyx, writer and editor of Latent Space. And Alessio is taking over with the intros, Alessio is Partner and CTO in residence at Decibel Partners. [00:00:20] Alessio: Hey everyone, today we have Geohot on the podcast, aka George Hotz. Everybody knows George, so I'm not going to do a big intro. A couple of things that people might have missed: you traded the first ever unlocked iPhone for a Nissan 350Z and three new iPhones. You were then one of the first people to break into the PS3 to run arbitrary code. You got sued by Sony, you wrote a rap song to fight against that, which is still live on YouTube, which we're going to have on the show notes. Did not go to Tesla to build vision, and instead you started Comma.ai, which was an amazing engineering feat in itself until you got a cease and desist from the government to not put these things on the street and turned that into a research only project. [00:01:00] George: You know they're out there. [00:01:01] Alessio: Yeah, yeah. [00:01:03] Swyx: They're out there. [00:01:04] Alessio: But like in a, you know, you market them as a research kind of like no warranty. [00:01:06] George: Because I use the word dev kit, that's not about the government, that's nothing to do with the government. We offer a great one-year warranty. The truth about that is it's gatekeeping. What's the difference between a dev kit and not a dev kit? Nothing. Just the question of do you think it's for you? And if you think it's for you, buy it. It's a consumer product. We call it a dev kit. If you have a problem with that, it's not for you. [00:01:28] Swyx: That's great insight. [00:01:30] Alessio: I was going through your blog posts to get ready. You've wrote this post about The Hero's Journey. And you linked this thing called the portal story, which is kind of the set of stories in movies and books about people living this arbitrary life. And then the run to this magic portals kind of takes them into a new, very exciting life and dimension. When you wrote that post, you talked about TinyGrad, which is one of the projects we're working on today. You mentioned this is more of a hobby, something that is not going to change the course of history. Obviously, you're now going full speed into it. So we would love to learn more about what was the portal that you ran into to get here. [00:02:03] George: Well, what you realize is... You know what made me realize that I absolutely had to do the company? Seeing Sam Altman go in front of Congress. Why? What are the odds they nationalize NVIDIA? What are the odds that large organizations in the government, but of course I repeat myself, decide to try to clamp down on accessibility of ML compute? I want to make sure that can't happen structurally. So that's why I realized that it's really important that I do this. And actually, from a more practical perspective, I'm working with NVIDIA and Qualcomm to buy chips. NVIDIA has the best training chips. Qualcomm has the best inference chips. Working with these companies is really difficult. So I'd like to start another organization that eventually in the limit, either works with people to make chips or makes chips itself and makes them available to an
Full Transcript and show notes: https://www.latent.space/p/function-agents?sd=pf Timestamps: [00:00:00] Intro [00:01:47] Recapping June 2023 Updates [00:06:24] Known Issues with Long Context [00:08:00] New Functions API [00:10:45] Riley Goodside [00:12:28] Simon Willison [00:14:30] Eric Elliott [00:16:05] Functions API and Agents [00:18:25] Functions API vs Google Vertex JSON [00:21:32] From English back to Code [00:26:14] Embedding Price Drop and Pinecone Perspective [00:30:39] Xenova and Huggingface Perspective [00:34:23] Function Selection [00:39:58] Designing Code Agents with Function API [00:42:16] Models as Routers [00:46:48] Prompt Engineering replaced by Finetuning [00:52:15] The 2 Code x LLM Paradigms [00:56:30] Smol Models for the future [00:58:54] The Evolution of the GPT API [01:03:27] Functions API Security vs Prompt Injection [01:16:18] GPT Model Upgrades [01:17:36] JSONformer [01:21:03] Closing Comments - What We Want Next This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Welcome to the almost 3k latent space explorers that joined us last month! We’re holding our first SF listener meetup with Practical AI next Monday; join us if you want to meet past guests and put faces to voices! All events are in /community. Who among you regularly click the ubiquitous 👍 /👎 buttons in ChatGPT/Bard/etc? Anyone? I don’t see any hands up. OpenAI has told us how important reinforcement learning from human feedback (RLHF) is to creating the magic that is ChatGPT, but we know from our conversation with Databricks’ Mike Conover just how hard it is to get just 15,000 pieces of explicit, high quality human responses. We are shockingly reliant on good human feedback. Andrej Karpathy’s recent keynote at Microsoft Build on the State of GPT demonstrated just how much of the training process relies on contractors to supply the millions of items of human feedback needed to make a ChatGPT-quality LLM (highlighted by us in red): But the collection of good feedback is an incredibly messy problem. First of all, if you have contractors paid by the datapoint, they are incentivized to blast through as many as possible without much thought. So you hire more contractors and double, maybe triple, your costs. Ok, you say, lets recruit missionaries, not mercenaries. People should volunteer their data! Then you run into the same problem we and any consumer review platform run into - the vast majority of people send nothing at all, and those who do are disproportionately representing negative reactions. More subtle problems emerge when you try to capture subjective human responses - the reason that ChatGPT responses tend to be inhumanly verbose, is because humans have a well documented “longer = better” bias when classifying responses in a “laboratory setting”. The fix for this, of course, is to get out of the lab and learn from real human behavior, not artificially constructed human feedback. You don’t see a thumbs up/down button in GitHub Copilot nor Codeium nor Codium. Instead, they work an implicit accept/reject event into the product workflow, such that you cannot help but to give feedback while you use the product. This way you hear from all your users, in their natural environments doing valuable tasks they are familiar with. The prototypal example in this is Midjourney, who unobtrusively collect 1 of 9 types of feedback from every user as part of their workflow, in exchange for much faster first draft image generations: The best known public example of AI product telemetry is in the Copilot-Explorer writeup, which checks for the presence of generated code after 15-600 second intervals, which enables GitHub to claim that 40% of code is generated by Copilot. This is fantastic and “obviously” the future of productized AI. Every AI application should figure out how to learn from all their real users, not some contractors in a foreign country. Most prompt engineers and prompt engineering tooling also tend to focus on pre-production prototyping, but could also benefit from A/B testing their prompts in the real world. In short, AI may need Analytics more than Analytics needs AI. Amplitude’s Month of AI This is why Amplitude is going hard on AI - and why we recently spent a weekend talking to Jeffrey Wang, cofounder and chief architect at Amplitude, and Joe Reeve, head of AI, recording a live episode at the AI + Product Hackathon where 150+ hackers gathered to compete for over $22.5k in prizes from Amplitude, New Relic, LanceDB, AWS, and more. To put things in perspective, Amplitude is a legendary YC alum with $238M of revenue in 2022 — our first guests representing the AI efforts of a public company! We chatted about how they have been approaching AI in their product (“question to chart” BI, text field autofill, instrumenting Amplitude with Amplitude), some of the issues they’ve had with different models, and the importance of first-party data in the world of LLMs. Another topic that came out of the Q&A was this idea of almost an “AmplitudeGPT”; rather than using language to simply generate a query, you could have these models investigate reasons for why certain behavior is happening in your user base. It was a really good discussion, and hope you all enjoy listening to it! Sections * [00:00:47] Amplitude's founding story and pivot * [00:03:28] Amplitude as an AI company and opportunities * [00:07:14] Limitations and challenges with using AI models * [00:10:56] Using Amplitude's product to build Amplitude - instrumenting AI * [00:12:32] Existing ML models in Amplitude's product and customer use cases * [00:15:50] “A/Z testing” and adaptable products * [00:19:33] The future of analytics and dashboards * [00:21:03] Optimizing for metrics in chatbots and AI products * [00:26:22] Using general models vs. fine-tuned models * [00:30:24] The importance of models vs. data - Amplitude's data set * [00:39:00] Lightning Round + Q&A Show Notes * Amplitude * Sonalight to Amplitude pivot announcement * The Slack origin story * Reverse Engineering Copilot * Simon Willison’s blog Transcript Editor’s note: all timestamps are 1 minute behind because we hadn’t yet added the intro before making these. Sorry about that! Alessio: Thank you everyone for coming. Hopefully, some of you have listened to the podcast before, if you haven't, we focus on AI research and application. So we don't focus on “AI is going to kill us all”. We don't think about virtual girlfriends. We don't think about all of these more societal things. We're focused on models: how do you build them? How do you train them? How do you use them in production? What are some of the limitations on getting these things from demos to things that millions of users use? And obviously, a lot of you are building things. Otherwise, you wouldn't be here. And some of you have been building things for a long time, and now have a new paradigm that you want to build on top of. So I'm excited to dive in here. And maybe, I mean, I'm sure most people know you, but maybe you want to do intros and give a little background. [00:00:47] Jeffrey: Sure. Yeah, hey, everyone, met you all this morning, but I'm Jeffrey. I'm one of the co-founders and Chief Architect here at Amplitude. Been working on this product analytics thing, helping people understand user behavior data and make great product decisions and build better products for the last decade or so. And obviously, AI is a technology that we've been leveraging for a long time, but the recent trends are particularly exciting. And yeah, we have a lot of thoughts on how to apply that to our space, what we're doing in our product, and what we think the future of AI and product development and product data is. So excited to talk through some of those. [00:01:20] Joe: Yeah, I'm Joe, Joe Reeve. I've got a background in sort of startups and tech, been professional software engineer since I was 16, quit college. And at the moment, I'm running sort of AI R&D efforts here at Amplitude. Super excited about all the new stuff, but also all the stuff that Amplitude's been doing for a long time and how we're sort of getting renewed interest and excitement and abilities to push that even further forwards. [00:01:44] Swyx: So I think it's useful for people listening on the podcast and also some people here. Can you contextualize Amplitude as an AI company? Like what does that mean to you? What unique opportunities do you guys have? [00:02:02] Jeffrey: Sure, yeah, happy to speak to that. So, you know, if we think about the fundamental thing that our customers of Amplitude try to do, it's they want to look at their product data and they want to figure out how do I make my product better? And the really cool thing about product data is that one, it's often like very high fidelity, right? Digital products compared to, you know, let's say physical products before them have way more information about what's going on. And so that's why product data is, you know, even a thing at all, right? You finally have that feedback loop of, hey, I built this thing. This is how people are using it. Now let me learn from that and make my product better. Now, one of the downsides of that is that the data is massive. If you look at any of the internet scale products out there, they generate enormous amounts of data. And the ability of humans to kind of sift through that data is obviously limited. At Amplitude, we try to give people as many tools, whether AI or not, in order to process that. But at the end of the day, if you could get from the data and what user behavior is happening in your product to the insights of how to make your product better without as much manual work, that's kind of the holy grail of product analytics. And so in some sense, Amplitude has always been a company on the path to AI because figuring out how to make your product better from data is ultimately an AI problem. And so we're kind of just solving all the barriers in the way, like getting data in first, building good models for short-term things. And long-term, it's always been about, hey, how can you take product data and automatically make your product better as fast as possible? [00:03:28] Alessio: So that's the future of Amplitude. And a lot of people here probably want to start companies and whatnot. So maybe you want to give a 60 seconds of why you started Amplitude and what the story was like and maybe the first three to six months, what the challenges were. [00:03:42] Jeffrey: Yeah, of course. It's funny that we talk about this because the start of Amplitude is actually almost more AI than the current state. And so actually my two co-founders, Spencer and Curtis, they went through YC originally with not Amplitude, but SonaLite, which was a text-by-voice company. So it was kind of before the era of Siri and those types of technologies where they wanted to build something that would read text messages to them, that's easy, but also do
Read: https://www.latent.space/p/ai-interfaces-and-notion Show Notes * Linus on Twitter * Linus’ personal blog * Notion * Notion AI * Notion Projects * AI UX Meetup Recap Timestamps * [00:03:30] Starting the AI / UX community * [00:10:01] Most knowledge work is not text generation * [00:16:21] Finding the right constraints and interface for AI * [00:19:06] Linus' journey to working at Notion * [00:23:29] The importance of notations and interfaces * [00:26:07] Setting interface defaults and standards * [00:32:36] The challenges of designing AI agents * [00:39:43] Notion deep dive: “Blocks”, AI, and more * [00:51:00] Prompt engineering at Notion * [01:02:00] Lightning Round Transcript Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20] Swyx: And today we're not in our regular studio. We're actually at the Notion New York headquarters. Thanks to Linus. Welcome. [00:00:28] Linus: Thank you. Thanks for having me. [00:00:29] Swyx: Thanks for having us in your beautiful office. It is actually very startling how gorgeous the Notion offices are. And it's basically the same aesthetic. [00:00:38] Linus: It's a very consistent aesthetic. It's the same aesthetic in San Francisco and the other offices. It's been for many, many years. [00:00:46] Swyx: You take a lot of craft in everything that you guys do. Yeah. [00:00:50] Linus: I think we can, I'm sure, talk about this more later, but there is a consistent kind of focus on taste that I think flows down from Ivan and the founders into the product. [00:00:59] Swyx: So I'll introduce you a little bit, but also there's just, you're a very hard person to introduce because you do a lot of things. You got your BA in computer science at Berkeley. Even while you're at Berkeley, you're involved in a bunch of interesting things at Replit, CatalystX, Hack Club and Dorm Room Fund. I always love seeing people come out of Dorm Room Fund because they tend to be a very entrepreneurial. You're a product engineer at IdeaFlow, residence at Betaworks. You took a year off to do independent research and then you've finally found your home at Notion. What's one thing that people should know about you that's not on your typical LinkedIn profile? [00:01:39] Linus: Putting me on the spot. I think, I mean, just because I have so much work kind of out there, I feel like professionally, at least, anything that you would want to know about me, you can probably dig up, but I'm a big city person, but I don't come from the city. I went to school, I grew up in Indiana, in the middle of nowhere, near Purdue University, a little suburb. I only came out to the Bay for school and then I moved to New York afterwards, which is where I'm currently. I'm in Notion, New York. But I still carry within me a kind of love and affection for small town, Indiana, small town, flyover country. [00:02:10] Swyx: We do have a bit of indulgence in this. I'm from a small country and I think Alessio, you also kind of identified with this a little bit. Is there anything that people should know about Purdue, apart from the chickens? [00:02:24] Linus: Purdue has one of the largest international student populations in the country, which I don't know. I don't know exactly why, but because it's a state school, the focus is a lot on STEM topics. Purdue is well known for engineering and so we tend to have a lot of folks from abroad, which is particularly rare for a university in, I don't know, that's kind of like predominantly white American and kind of Midwestern state. That makes Purdue and the surrounding sort of area kind of like a younger, more diverse international island within the, I guess, broader world that is Indiana. [00:02:58] Swyx: Fair enough. We can always dive into sort of flyover country or, you know, small town insights later, but you and I, all three of us actually recently connected at AIUX SF, which is the first AIUX meetup, essentially which just came out of like a Twitter conversation. You and I have been involved in HCI Twitter is kind of how I think about it for a little bit and when I saw that you were in town, Geoffrey Litt was in town, Maggie Appleton in town, all on the same date, I was like, we have to have a meetup and that's how this thing was born. Well, what did it look like from your end? [00:03:30] Linus: From my end, it looked like you did all of the work and I... [00:03:33] Swyx: Well, you got us the Notion. Yeah, yeah. [00:03:36] Linus: It was also in the Notion office, it was in the San Francisco one and then thereafter there was a New York one that I decided I couldn't make. But yeah, from my end it was, and I'm sure you were too, but I was really surprised by both the mixture of people that we ended up getting and the number of people that we ended up getting. There was just a lot of attention on, obviously there was a lot of attention on the technology itself of GPT and language models and so on, but I was surprised by the interest specifically on trying to come up with interfaces that were outside of the box and the people that were interested in that topic. And so we ended up having a packed house and lots of interesting demos. I've heard multiple people comment on the event afterwards that they were positively surprised by the mixture of both the ML, AI-focused people at the event as well as the interface HCI-focused people. [00:04:24] Swyx: Yeah. I kind of see you as one of the leading, I guess, AI UX people, so I hope that we are maybe starting a new discipline, maybe. [00:04:33] Linus: Yeah, I mean, there is this kind of growing contingency of people interested in exploring the intersection of those things, so I'm excited for where that's going to go. [00:04:41] Swyx: I don't know if it's worth going through favorite demos. It was a little while ago, so I don't know if... [00:04:48] Alessio: There was, I forget who made it, but there was this new document writing tool where you could apply brushes to different paragraphs. [00:04:56] Linus: Oh, this was Amelia's. Yeah, yeah, yeah. [00:04:58] Alessio: You could set a tone, both in terms of writer inspiration and then a tone that you wanted, and then you could drag and drop different tones into paragraphs and have the model rewrite them. It was the first time that it's not just auto-complete, there's more to it. And it's not asked in a prompt, it's this funny drag-an-emoji over it. [00:05:20] Linus: Right. [00:05:21] Swyx: I actually thought that you had done some kind of demo where you could select text and then augment it in different moods, but maybe it wasn't you, maybe it was just someone else [00:05:28] Linus: I had done something similar, with slightly different building blocks. I think Amelia's demo was, there was sort of a preset palette of brushes and you apply them to text. I had built something related last year, I prototyped a way to give people sliders for different semantic attributes of text. And so you could start with a sentence, and you had a slider for length and a slider for how philosophical the text is, and a slider for how positive or negative the sentiment in the text is, and you could adjust any of them in the language model, reproduce the text. Yeah, similar, but continuous control versus distinct brushes, I think is an interesting distinction there. [00:06:03] Swyx: I should add it for listeners, if you missed the meetup, which most people will have not seen it, we actually did a separate post with timestamps of each video, so you can look at that. [00:06:13] Alessio: Sorry, Linus, this is unrelated, but I think you build over a hundred side projects or something like that. A hundred? [00:06:20] Swyx: I think there's a lot of people... I know it's a hundred. [00:06:22] Alessio: I think it's a lot of them. [00:06:23] Swyx: A lot of them are kind of small. [00:06:25] Alessio: Yeah, well, I mean, it still counts. I think there's a lot of people that are excited about the technology and want to hack on things. Do you have any tips on how to box, what you want to build, how do you decide what goes into it? Because all of these things, you could build so many more things on top of it. Where do you decide when you're done? [00:06:44] Linus: So my projects actually tend to be... I think especially when people approach project building with a goal of learning, I think a common mistake is to be over-ambitious and sort of not scope things very tightly. And so a classic kind of failure mode is, you say, I'm really interested in learning how to use the GPT-4 API, and I'm also interested in vector databases, and I'm also interested in Next.js. And then you devise a project that's going to take many weeks, and you glue all these things together. And it could be a really cool idea, but then especially if you have a day job and other things that life throws you away, it's hard to actually get to a point where you can ship something. And so one of the things that I got really good at was saying, one, knowing exactly how quickly I could work, at least on the technologies that I knew well, and then only adding one new unknown thing to learn per project. So it may be that for this project, I'm going to learn how the embedding API works. Or for this project, I'm going to learn how to do vector stuff with PyTorch or something. And then I would scope things so that it fit in one chunk of time, like Friday night to Sunday night or something like that. And then I would scope the project so that I could ship something as much work as I could fit into a two-day period, so that at the end of that weekend, I could ship something. And then afterwards, if I want to add something, I have time to do it and a chance to do that. But it's already shipped, so there's already momentum, and people are using it, or I'm using it, and so there's a reason to continue building. So only adding one new
We are hosting the AI World’s Fair in San Francisco on June 8th! You can RSVP here. Come meet fellow builders, see amazing AI tech showcases at different booths around the venue, all mixed with elements of traditional fairs: live music, drinks, games, and food! We are also at Amplitude’s AI x Product Hackathon and are hosting our first joint Latent Space + Practical AI Podcast Listener Meetup next month! We are honored by the rave reviews for our last episode with MosaicML! They are also welcome on Apple Podcasts and Twitter/HN/LinkedIn/Mastodon etc! We recently spent a wonderful week with Itamar Friedman, visiting all the way from Tel Aviv in Israel: * We first recorded a podcast (releasing with this newsletter) covering Codium AI, the hot new VSCode/Jetbrains IDE extension focused on test generation for Python and JS/TS, with plans for a Code Integrity Agent. * Then we attended Agent Weekend, where the founders of multiple AI/agent projects got together with a presentation from Toran Bruce Richards on Auto-GPT’s roadmap and then from Itamar on Codium’s roadmap * Then some of us stayed to take part in the NextGen Hackathon and won first place with the new AI Maintainer project. So… that makes it really hard to recap everything for you. But we’ll try! Podcast: Codium: Code Integrity with Zero Bugs When it launched in 2021, there was a lot of skepticism around Github Copilot. Fast forward to 2023, and 40% of all code is checked in unmodified from Copilot. Codium burst on the scene this year, emerging from stealth with an $11m seed, their own foundation model (TestGPT-1) and a vision to revolutionize coding by 2025. You might have heard of "DRY” programming (Don’t Repeat Yourself), which aims to replace repetition with abstraction. Itamar came on the pod to discuss their “extreme DRY” vision: if you already spent time writing a spec, why repeat yourself by writing the code for it? If the spec is thorough enough, automated agents could write the whole thing for you. Live Demo Video Section This is referenced in the podcast about 6 minutes in. Timestamps, show notes, and transcript are below the fold. We would really appreciate if you shared our pod with friends on Twitter, LinkedIn, Mastodon, Bluesky, or your social media poison of choice! Auto-GPT: A Roadmap To The Future of Work Making his first public appearance, Toran (perhaps better known as @SigGravitas on GitHub) presented at Agents Weekend: Lightly edited notes for those who want a summary of the talk: * What is AutoGPT? AutoGPT is an Al agent that utilizes a Large Language Model to drive its actions and decisions. It can be best described as a user sitting at a computer, planning and interacting with the system based on its goals. Unlike traditional LLM applications, AutoGPT does not require repeated prompting by a human. Instead, it generates its own 'thoughts', criticizes its own strategy and decides what next actions to take. * AutoGPT was released on GitHub in March 2023, and went viral on April 1 with a video showing automatic code generation. 2 months later it has 132k+ stars, is the 29th highest ranked open-source project of all-time, a thriving community of 37.5k+ Discord members, 1M+ downloads. * What’s next for AutoGPT? The initial release required users to know how to build and run a codebase. They recently announced plans for a web/desktop UI and mobile app to enable nontechnical/everyday users to use AutoGPT. They are also working on an extensible plugin ecosystem called the Abilities Hub also targeted at nontechnical users. * Improving Efficacy. AutoGPT has many well documented cases where it trips up. Getting stuck in loops, using instead of actual content in commands, and making obvious mistakes like execute_code("write a cookbook"'. The plan is a new design called Challenge Driven Development - Challenges are goal-orientated tasks or problems that Auto-GPT has difficulty solving or has not yet been able to accomplish. These may include improving specific functionalities, enhancing the model's understanding of specific domains, or even developing new features that the current version of Auto-GPT lacks. (AI Maintainer was born out of one such challenge). Itamar compared this with Software 1.0 (Test Driven Development), and Software 2.0 (Dataset Driven Development). * Self-Improvement. Auto-GPT will analyze its own codebase and contribute to its own improvement. AI Safety (aka not-kill-everyone-ists) people like Connor Leahy might freak out at this, but for what it’s worth we were pleasantly surprised to learn that Itamar and many other folks on the Auto-GPT team are equally concerned and mindful about x-risk as well. The overwhelming theme of Auto-GPT’s roadmap was accessibility - making AI Agents usable by all instead of the few. Podcast Timestamps * [00:00:00] Introductions * [00:01:30] Itamar’s background and previous startups * [00:03:30] Vision for Codium AI: reaching “zero bugs” * [00:06:00] Demo of Codium AI and how it works * [00:15:30] Building on VS Code vs JetBrains * [00:22:30] Future of software development and the role of developers * [00:27:00] The vision of integrating natural language, testing, and code * [00:30:00] Benchmarking AI models and choosing the right models for different tasks * [00:39:00] Codium AI spec generation and editing * [00:43:30] Reconciling differences in languages between specs, tests, and code * [00:52:30] The Israeli tech scene and startup culture * [01:03:00] Lightning Round Show Notes * Codium AI * Visualead * AutoGPT * StarCoder * TDD (Test-Driven Development) * AST (Abstract Syntax Tree) * LangChain * ICON * AI21 Transcript Alessio: [00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio, Partner and CTO-in-Residence at Decibel Partners. I'm joined by my co-host, Swyx, writer and editor of Latent Space. Swyx: Today we have a special guest, Tamar Friedman, all the way from Tel Aviv, CEO and co-founder of Codium AI. Welcome. Itamar: Hey, great being here. Thank you for inviting me. Swyx: You like the studio? It's nice, right? Itamar: Yeah, they're awesome. Swyx: So I'm gonna introduce your background a little bit and then we'll learn a bit more about who you are. So you graduated from Teknion Israel Institute of Technology's kind of like the MIT of of Israel. You did a BS in CS, and then you also did a Master's in Computer Vision, which is kind of relevant. You had other startups before this, but your sort of claim to fame is Visualead, which you started in 2011 and got acquired by Alibaba Group You showed me your website, which is the sort of QR codes with different forms of visibility. And in China that's a huge, huge deal. It's starting to become a bigger deal in the west. My favorite anecdote that you told me was something about how much sales use you saved or something. I forget what the number was. Itamar: Generally speaking, like there's a lot of peer-to-peer transactions going on, like payments and, and China with QR codes. So basically if for example 5% of the scanning does not work and with our scanner we [00:01:30] reduce it to 4%, that's a lot of money. Could be tens of millions of dollars a day. Swyx: And at the scale of Alibaba, it serves all of China. It's crazy. You did that for seven years and you're in Alibaba until 2021 when you took some time off and then hooked up with Debbie, who you've known for 25 years, to start Codium AI and you just raised your $11 million seed rounds with TlB Partners and Vine. Congrats. Should we go right into Codium? What is Codium? Itamar: So we are an AI coding assistant / agent to help developers reaching zero bugs. We don't do that today. Right now, we help to reduce the amount of bugs. Actually you can see people commenting on our marketplace page saying that they found bugs with our tool, and that's like our premise. Our vision is like for Tesla zero emission or something like that, for us it's zero bugs. We started with building an IDE extension either in VS Code or in JetBrains. And that actually works alongside the main panel where you write your code and I can show later what we do is analyze the code, whether you started writing it or you completed it. Like you can go both TDD (Test-Driven Development) or classical coding. And we offer analysis, tests, whether they pass or not, we further self debug [00:03:00] them and make suggestions eventually helping to improve the code quality specifically on code logic testing. Alessio: How did you get there? Obviously it's a great idea. Like, what was the idea, maze? How did you get here? Itamar: I'll go back long. So, yes I was two and a half times a CTO, VC backed startup CTO where we talked about the last one that I sold to Alibaba. But basically I'm like, it's weird to say by 20 years already of R&D manager, I'm not like the best programmer because like you mentioned, I'm coming more from the machine learning / computer vision side, one, one of the main application, but a lot of optimization. So I’m not necessarily the best coder, but I am like 20 year R&D manager. And I found that verifying code logic is very hard thing. And one of the thing that really makes it difficult to increase the development velocity. So you have tools related to checking performance.You have tools for vulnerabilities and security, Israelis are really good at that. But do you have a tool that actually helps you test code logic? I think what we have like dozens or hundreds, even thousands that help you on the end to end, maybe on the microservice integration system. But when you talk about code level, there isn't anything. So that was the pain I always had, especially when I did have tools for that, for the hardware. Like I worked in Mellanox to be sold to Nvidia as a student, and we had formal tools, et cetera. [00:04:30] So that's one part. The second thing is that after being sold to Alibaba, the team and I were quite a big team that worked on machine
We are excited to be the first podcast in the world to release an in-depth interview on the new SOTA in commercially licensed open source models - MosiacML MPT-7B! The Latent Space crew will be at the NYC Lux AI Summit next week, and have two meetups in June. As usual, all events are on the Community page! We are also inviting beta testers for the upcoming AI for Engineers course. See you soon! One of GPT3’s biggest limitations is context length - you can only send it up to 4000 tokens (3k words, 6 pages) before it throws a hard error, requiring you to bring in LangChain and other retrieval techniques to process long documents and prompts. But MosaicML recently open sourced MPT-7B, the newest addition to their Foundation Series, with context length going up to 84,000 tokens (63k words, 126 pages): This transformer model, trained from scratch on 1 trillion tokens of text and code (compared to 300B for Pythia and OpenLLaMA, and 800B for StableLM), matches the quality of LLaMA-7B. It was trained on the MosaicML platform in 9.5 days on 440 GPUs with no human intervention, costing approximately $200,000. Unlike many open models, MPT-7B is licensed for commercial use and it’s optimized for fast training and inference through FlashAttention and FasterTransformer. They also released 3 finetuned models starting from the base MPT-7B: * MPT-7B-Instruct: finetuned on dolly_hhrlhf, a dataset built on top of dolly-5k (see our Dolly episode for more details). * MPT-7B-Chat: finetuned on the ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets. * MPT-7B-StoryWriter-65k+: it was finetuned with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. While 65k is the advertised size, the team has gotten up to 84k tokens in response when running on a single node A100-80GB GPUs. ALiBi is the dark magic that makes this possible. Turns out The Great Gatsby is only about 68k tokens, so the team used the model to create new epilogues for it! On top of the model checkpoints, the team also open-sourced the entire codebase for pretraining, finetuning, and evaluating MPT via their new MosaicML LLM Foundry. The table we showed above was created using LLM Foundry in-context-learning eval framework itself! In this episode, we chatted with the leads of MPT-7B at Mosaic: Jonathan Frankle, Chief Scientist, and Abhinav Venigalla, Research Scientist who spearheaded the MPT-7B training run. We talked about some of the innovations they’ve brought into the training process to remove the need for 2am on-call PagerDutys, why the LLM dataset mix is such an important yet dark art, and why some of the traditional multiple-choice benchmarks might not be very helpful for the type of technology we are building. Show Notes * Introducing MPT-7B * Cerebras * Lottery Ticket Hypothesis * Hazy Research * ALiBi * Flash Attention * FasterTransformer * List of naughty words for C4 https://twitter.com/code_star/status/1661386844250963972 * What is Sparsity? * Hungry Hungry Hippos * BF16 FP p.s. yes, MPT-7B really is codenamed LLongboi! Timestamps * Introductions [00:00:00] * Intro to Mosaic [00:03:20] * Training and Creating the Models [00:05:45] * Data Choices and the Importance of Repetition [00:08:45] * The Central Question: What Mix of Data Sets Should You Use? [00:10:00] * Evaluation Challenges of LLMs [0:13:00] * Flash Attention [00:16:00] * Fine-tuning for Creativity [00:19:50] * Open Source Licenses and Ethical Considerations [00:23:00] * Training Stability Enhancement [00:25:15] * Data Readiness & Training Preparation [00:30:00] * Dynamic Real-time Model Evaluation [00:34:00] * Open Science for Affordable AI Research [00:36:00] * The Open Approach [00:40:15] * The Future of Mosaic [00:44:11] * Speed and Efficiency [00:48:01] * Trends and Transformers [00:54:00] * Lightning Round and Closing [1:00:55] Transcript Alessio: [00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio partner and CTO-in-Residence at Decibel Partners. I'm joined by my co-host, Swyx, writer and editor of Latent Space. Swyx: Hey, and today we have Jonathan and Abhi from Mosaic ML. Welcome to our studio. Jonathan: Guys thank you so much for having us. Thanks so much. Swyx: How's it feel? Jonathan: Honestly, I've been doing a lot of podcasts during the pandemic, and it has not been the same. Swyx: No, not the same actually. So you have on your bio that you're primarily based in Boston, Jonathan: New York. New York, yeah. My Twitter bio was a probability distribution over locations. Swyx: Exactly, exactly. So I DMd you because I was obviously very interested in MPT-7B and DMd you, I was like, for the 0.2% of the time that you're in San Francisco, can you come please come to a podcast studio and you're like, I'm there next week. Jonathan: Yeah, it worked out perfectly. Swyx: We're really lucky to have you, I'll read off a few intros that people should know about you and then you can fill in the blanks. So Jonathan, you did your BS and MS at Princeton in programming languages and then found your way into ML for your PhD at MiT where you made a real splash with the lottery ticket hypothesis in 2018, which people can check up on. I think you've done a few podcasts about it over the years, which has been highly influential, and we'll talk about sparse models at Mosaic. You have also had some side [00:01:30] quest. You taught programming for lawyers and you did some law and privacy stuff in, in DC and also did some cryptography stuff. Um, and you've been an assistant professor at Harvard before earning your PhD. Jonathan: I've yet to start. Swyx: You, you yet to start. Okay. But you just got your PhD. Jonathan:. I technically just got my PhD. I was at Mosaic which delayed my defense by about two years. It was, I was at 99% done for two years. Got the job at Harvard, Mosaic started, and I had better things to do than write my dissertation for two years. Swyx: You know, you know, this is very out of order. Jonathan: Like, oh, completely out of order, completely backwards. Go talk to my advisor about that. He's also an advisor at Mosaic and has been from the beginning. And, you know, go talk to him about finishing on time. Swyx: Great, great, great. And just to fill it out, Abhi, you did your BS and MS and MIT, you were a researcher at Cerebras, and you're now a research scientist at Mosaic. Just before we go into Mosaic stuff, I'm actually very curious about Cereus and, uh, just that, that space in general. Um, what are they doing that people should know about? Abhinav: Yeah, absolutely. Um, I think the biggest thing about CEREUS is that they're really building, you know, kind of the NextGen computing platform beyond, like GPUs. Um, they're trying to build a system that uses an entire wafer, you know, rather than cutting up a wafer into smaller chips and trying to train a model on that entire system, or actually more recently on many such wafers. Um, so it's, and it's really extraordinary. I think it's like the first time ever that kind of wafer scale computing has ever really worked. And so it's a really exciting time to be there, trying to figure out how we can map ML workloads to work, um, on a much, much bigger chip. Swyx: And do you use like [00:03:00] a different programming language or framework to do that? Or is that like.. Abhinav: Yeah, so I mean, things have changed a bit since I was there. I think, um, you can actually run just normal tensor flow and pie torch on there. Um, so they've built a kind of software stack that compiles it down. So it actually just kind of works naturally. But yeah. Jonathan : Compiled versions of Python is a hot topic at the moment with Mojo as well. Swyx: And then Mosaic, you, you spearheaded the MPT-7B effort. INTRO TO MOSAIC [00:03:20] Abhinav: Uh, yeah. Yeah, so it's kind of like, it's been maybe six months, 12 months in the making. We kind of started working on LMs sort of back in the summer of last year. Um, and then we came with this blog post where we kind of profiled a lot of LMs and saw, hey, the cost of training is actually a lot lower than what people might think. Um, and then since then, you know, being inspired by kind of, you know, meta’s release, so the LLaMA models and lots of other open source work, we kind of started working towards, well, what if we were to release a really good kind of 7 billion parameter model? And that's what MPT is. Alessio:You know, we mentioned some of the podcasts you had done, Jonathan, I think in one of them you mentioned Mosaic was not planning on building a model and releasing and obviously you eventually did. So what are some of the things that got you there that maybe obviously LLaMA you mentioned was an inspiration. You now have both the training and like inference products that you offer. Was this more of a research challenge in a way, uh, that you wanted to do? Or how did the idea come to be? Jonathan: I think there were a couple of things. So we still don't have a first class model. We're not an open AI where, you know, our businesses come to use our one great model. Our business is built around customers creating their own models. But at the end of the day, if customers are gonna create their own models, we have to have the tools to help them do that, and to have the tools to help them do that and know that they work we have to create our own models to start. We have to know that we can do something great if customers are gonna do something great. And one too many people may have challenged me on Twitter about the fact that, you know, mosaic claims all these amazing numbers, but, you know, I believe not to, you know, call out Ross Whiteman here, but, you know, I believe he said at some point, you know, show us the pudding. Um, and so Ross, you know, please let me know how the pudding tastes. But in all seriousness, like I think there is something, this is a demo in some sense. This is to say we did this
Tomorrow, 5/16, we’re hosting Latent Space Liftoff Day in San Francisco. We have some amazing demos from founders at 5:30pm, and we’ll have an open co-working starting at 2pm. Spaces are limited, so please RSVP here! One of the biggest criticisms of large language models is their inability to tightly follow requirements without extensive prompt engineering. You might have seen examples of ChatGPT playing a game of chess and making many invalid moves, or adding new pieces to the board. Guardrails AI aims to solve these issues by adding a formalized structure around inference calls, which validates both the structure and quality of the output. In this episode, Shreya Rajpal, creator of Guardrails AI, walks us through the inspiration behind the project, why it’s so important for models’ outputs to be predictable, and why she went with an XML-like syntax. Guardrails TLDR Guardrails AI rules are created as RAILs, which have three main “atomic objects”: * Output: what should the output look like? * Prompt: template for requests that can be interpolated * Script: custom rules for validation and correction Each RAIL can then be used as a “guard” when calling an LLM. You can think of a guard as a wrapper for the API call. Before returning the output, it will validate it, and if it doesn’t pass it will ask the model again. Here’s an example of a bad SQL query being returned, and what the ReAsk query looks like: Each RAIL is also model-agnostic. This allows for output consistency across different models, even if they have slight differences in how they are prompted. Guardrails can easily be used with LangChain and other tools to structure your outputs! Show Notes * Guardrails AI * Text2SQL * Use Guardrails and GPT to play valid chess * Shreya’s AI Tinkerers demo * Hazy Research Lab * AutoPR * Ian Goodfellow * GANs (Generative Adversarial Networks) Timestamps * [00:00:00] Shreya's Intro * [00:02:30] What's Guardrails AI? * [00:05:50] Why XML instead of YAML or JSON? * [00:10:00] SQL as a validation language? * [00:14:00] RAIL composability and package manager? * [00:16:00] Using Guardrails for agents * [00:23:50] Guardrails "contracts" and guarantees * [00:31:30] SLAs for LLMs * [00:40:00] How to prioritize as a solo founder in open source * [00:43:00] Guardrails open source community involvement * [00:46:00] Working with Ian Goodfellow * [00:50:00] Research coming out of Stanford * [00:52:00] Lightning Round Transcript Alessio: [00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio partner and CTO-in-Residence at Decibel Partners. I'm joined by my cohost Swyx, writer and editor of Latent Space. Swyx: And today we have Shreya Rajpal in the studio. Welcome Shreya. Shreya: Hi. Hi. Excited to be here. Swyx: Excited to have you too. This has been a long time coming, you and I have chatted a little bit and excited to learn more about guardrails. We do a little intro for you and then we have you fill in the blanks. So you, you got your bachelor's at IIT Delhi minor in computer science with focus on AI, which is super relevant now. I bet you didn't think about that in undergrad. Shreya: Yeah, I think it's, it's interesting because like, I started working in AI back in 2014 and back then I was like, oh, it's, it's here. This is like almost changing the world already. So it feels like that that like took nine years, that meme of like, almost like almost arriving the thing. So yeah, I, it's felt this way where [00:01:00] it's almost shared. It's almost changed the world for as long as I've been working in it. Swyx: Yeah. That's awesome. Maybe we can explore your, like the origins of your interests, because then you went on to U I U C to do your master's also in ai. And then it looks like you went to drive.ai to work on Perception and then to Apple S P G as, as the cool kids call it special projects group working with Ian Goodfellow. Yeah, that's right. And then you were at pretty base up until recently? Actually, I don't know if you've quit yet. I have, yeah. Okay, good, good, good. You haven't updated e LinkedIn, but we're getting the by breaking news that you're working on guardrails full-time. Yeah, well that's the professional history. We can double back to fill in the blanks on anything. But what's a personal side? You know, what's not on your LinkedIn that people should know about you? Shreya: I think the most obvious thing, this is like, this is still professional, but the most obvious thing that isn't on my LinkedIn yet is, is Guardrails. So, yeah. Like you mentioned, I haven't updated my LinkedIn yet, but I quit some time ago and I've been devoting like all of my energy. Yeah. Full-time working on Guardrails and growing the open source package and building out exciting features, et cetera. So that's probably the thing that's missing the most. I think another. More personal skill, which I [00:02:00] think I'm like kind of okay for an amateur and that isn't on my LinkedIn is, is pottery. So I really enjoy pottery and yeah, don't know how to slot that in amongst, like, all of the AI. So that's not in there. Swyx: Well, you like shaping things into containers where, where like unstructured things and kind of flow in, so, yeah, yeah, yeah. See I can, I can spin it for you. Shreya: I should, I should use that. Yeah. Yeah. Alessio: Maybe for the audience, you wanna give a little bit of intro on Guardrails AI, what it is, why you wanted to start it Shreya: Yeah, yeah, for sure. So Guardrails or, or the need for Guardrails really came up as I was kind of like building some of my own projects in the space and like really solving some of my own problems. So this was back of like end of last year I was kind of building some applications, like everybody else was very excited about the space. And I built some stuff and I quickly realized that yeah, I could, you know it works like pretty well a bunch of times, but like a lot of other times it really does not work as I, the developer of this tool, like, want my tool to work. And then as a developer like I can tell that there's very few tools available for me to like, get this to, you know cooperate [00:03:00] with me, like get it to follow directions, etc. And the only tool I really have is this prompt. And there's only so, so far you can go with like, putting instructions in like caps, adding a bunch of exclamations and being like, follow my instructions. Like give me this output this way. And so I think like part of it was, You know that it's not reliable, et cetera. But also as a user, it just if I'm building an application for a user, I just want the user to have a have a certain experience using it. And there's just not enough control to me, not enough, like knobs for me to tune, you know as a developer to do that. So guardrails kind of like came up as a way to just like, manage this better. The tool basically, I was like, okay. As I'm building this, I know from the ground up, like what is the experience I want the user to add, to have like, what is a great LLM output look like for me? And so I wanted a tool that allows me to kind of specify that and enforce those constraints. As I was thinking of this, I was like, this should be very extensible, very flexible so that there's a bunch of use cases that can be handled, et cetera. But the need really like, kind of came up from my own from my own, like I was basically solving for my own pain points.[00:04:00] So that's a little bit of the history, but what the tool does is that it allows you to kind of like specify. It's this two-part system where there's a specification framework and then there's like a code that enforces that specification on the LLM outputs. So the specification framework allows you to be like as coarse or as fine grained as you care about. So you can essentially think about what is the, on a very like first order business, like where is the structure and what are the types, etc, of the output that I want. If you want structured outputs from LLMs. But you can also go like very into semantic correctness with this, with a. I just released something this morning, which is that if you're summarizing a bunch of documents, make sure that it's a very faithful summary. Make sure that there's like coherence amongst like what the output is, et cetera. So you can have like all of these semantic guarantees as well. And guardrails created like rails, like a reliable AI markup language that allows you to specify that. And along with that, there's like code that backs up that specification and it makes sure that a, you're just generating prompts that are more likely to get you the output in the right manner to start out with. And then once you get that output all of the specification criteria you entered is like [00:05:00] systematically validated and like corrected. And there's a bunch of like tools in there that allow you a lot of control to like handle failures much more gracefully. So that's in a nutshell what guardrails does. Awesome. Alessio: And this is model agnostic. People can use it on any model. Shreya: Yeah, that's right. When I was doing my prototyping, I like was developing with like OpenAI, as I'm sure like a bunch of other developers were. But since then I've added support where you can basically like plug in any, essentially any function or any callable as long as you, it has a string input. String output you can plug it in there and I've had people test it out with a bunch of other models and get pretty good results. Yeah. Alessio: That's awesome. Why did you start from XML instead of YAML or JSON? Shreya: Yeah. Yeah. I think it's a good question. It's also the question I get asked the most. Yes. I remember we chat about this as well the first chat and I was like, wait, okay, let's get it out of the way. Cause I'm sure you answered this a lot. Shreya: So it is I didn't start out with it is the truth. Like, I think I started out from this code first framewo
Thanks to the over 42,000 latent space explorers who checked out our Replit episode! We are hosting/attending a couple more events in SF and NYC this month. See you if in town! Lexica.art was introduced to the world 24 hours after the release of Stable Diffusion as a search engine for prompts, gaining instant product-market fit as a world discovering generative AI also found they needed to learn prompting by example. Lexica is now 8 months old, serving 5B image searches/day, and just shipped V3 of Lexica Aperture, their own text-to-image model! Sharif Shameem breaks his podcast hiatus with us for an exclusive interview covering his journey building everything with AI! The conversation is nominally about Sharif’s journey through his three startups VectorDash, Debuild, and now Lexica, but really a deeper introspection into what it takes to be a top founder in the fastest moving tech startup scene (possibly ever) of AI. We hope you enjoy this conversation as much as we did! Full transcript is below the fold. We would really appreciate if you shared our pod with friends on Twitter, LinkedIn, Mastodon, Bluesky, or your social media poison of choice! Timestamps * [00:00] Introducing Sharif * [02:00] VectorDash * [05:00] The GPT3 Moment and Building Debuild * [09:00] Stable Diffusion and Lexica * [11:00] Lexica’s Launch & How it Works * [15:00] Being Chronically Early * [16:00] From Search to Custom Models * [17:00] AI Grant Learnings * [19:30] The Text to Image Illuminati? * [20:30] How to Learn to Train Models * [24:00] The future of Agents and Human Intervention * [29:30] GPT4 and Multimodality * [33:30] Sharif’s Startup Manual * [38:30] Lexica Aperture V1/2/3 * [40:00] Request for AI Startup - LLM Tools * [41:00] Sequencing your Genome * [42:00] Believe in Doing Great Things * [44:30] Lightning Round Show Notes * Sharif’s website, Twitter, LinkedIn * VectorDash (5x cheaper than AWS) * Debuild Insider, Fast company, MIT review, tweet, tweet * Lexica * Introducing Lexica * Lexica Stats * Aug: “God mode” search * Sep: Lexica API * Sept: Search engine with CLIP * Sept: Reverse image search * Nov: teasing Aperture * Dec: Aperture v1 * Dec - Aperture v2 * Jan 2023 - Outpainting * Apr 2023 - Aperture v3 * Same.energy * AI Grant * Sharif on Agents: prescient Airpods tweet, Reflection * MiniGPT4 - Sharif on Multimodality * Sharif Startup Manual * Sharif Future * 23andMe Genome Sequencing Tool: Promethease * Lightning Round * Fave AI Product: Cursor.so. Swyx ChatGPT Menubar App. * Acceleration: Multimodality of GPT4. Animated Drawings * Request for Startup: Tools for LLMs, Brex for GPT Agents * Message: Build Weird Ideas! Transcript Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO on Residence at Decibel Partners. I'm joined by my co-host Wix, writer and editor of Latent Space. And today we have Sharish Amin. Welcome to the studio. Sharif: Awesome. Thanks for the invite. Swyx: Really glad to have you. [00:00] Introducing Sharif Swyx: You've been a dream guest, actually, since we started drafting guest lists for this pod. So glad we could finally make this happen. So what I like to do is usually introduce people, offer their LinkedIn, and then prompt you for what's not on your LinkedIn. And to get a little bit of the person behind the awesome projects. So you graduated University of Maryland in CS. Sharif: So I actually didn't graduate, but I did study. Swyx: You did not graduate. You dropped out. Sharif: I did drop out. Swyx: What was the decision behind dropping out? Sharif: So first of all, I wasn't doing too well in any of my classes. I was working on a side project that took up most of my time. Then I spoke to this guy who ended up being one of our investors. And he was like, actually, I ended up dropping out. I did YC. And my company didn't end up working out. And I returned to school and graduated along with my friends. I was like, oh, it's actually a reversible decision. And that was like that. And then I read this book called The Case Against Education by Brian Kaplan. So those two things kind of sealed the deal for me on dropping out. Swyx: Are you still on hiatus? Could you still theoretically go back? Sharif: Theoretically, probably. Yeah. Still on indefinite leave. Swyx: Then you did some work at Mitra? Sharif: Mitra, yeah. So they're lesser known. So they're technically like an FFRDC, a federally funded research and development center. So they're kind of like a large government contractor, but nonprofit. Yeah, I did some computer vision work there as well. [02:00] VectorDash Swyx: But it seems like you always have an independent founder bone in you. Because then you started working on VectorDash, which is distributed GPUs. Sharif: Yes. Yeah. So VectorDash was a really fun project that we ended up working on for a while. So while I was at Mitra, I had a friend who was mining Ethereum. This was, I think, 2016 or 2017. Oh my God. Yeah. And he was mining on his NVIDIA 1080Ti, making around like five or six dollars a day. And I was trying to train a character recurrent neural network, like a character RNN on my iMessage text messages to make it like a chatbot. Because I was just curious if I could do it. Because iMessage stores all your past messages from years ago in a SQL database, which is pretty nifty. But I wanted to train it. And I needed a GPU. And it was, I think, $60 to $80 for a T4 on AWS, which is really slow compared to a 1080Ti. If you normalize the cost and performance versus the 1080Ti when someone's mining Ethereum, it's like a 20x difference. So I was like, hey, his name was Alex. Alex, I'll give you like 10 bucks if you let me borrow your 1080Ti for a week. I'll give you 10 bucks per day. And it was like 70 bucks. And I used it to train my model. And it worked great. The model was really bad, but the whole trade worked really great. I got a really high performance GPU to train my model on. He got much more than he was making by mining Ethereum. So we had this idea. I was like, hey, what if we built this marketplace where people could rent their GPUs where they're mining cryptocurrency and machine learning researchers could just rent them out and pay a lot cheaper than they would pay AWS. And it worked pretty well. We launched in a few months. We had over 120,000 NVIDIA GPUs on the platform. And then we were the cheapest GPU cloud provider for like a solid year or so. You could rent a pretty solid GPU for like 20 cents an hour. And cryptocurrency miners were making more than they would make mining crypto because this was after the Ethereum crash. And yeah, it was pretty cool. It just turns out that a lot of our customers were college students and researchers who didn't have much money. And they weren't necessarily the best customers to have as a business. Startups had a ton of credits and larger companies were like, actually, we don't really trust you with our data, which makes sense. Yeah, we ended up pivoting that to becoming a cloud GPU provider for video games. So we would stream games from our GPUs. Oftentimes, like many were located just a few blocks away from you because we had the lowest latency of any cloud GPU provider, even lower than like AWS and sometimes Cloudflare. And we decided to build a cloud gaming platform where you could pretty much play your own games on the GPU and then stream it back to your Mac or PC. Swyx: So Stadia before Stadia. Sharif: Yeah, Stadia before Stadia. It's like a year or so before Stadia. Swtx: Wow. Weren't you jealous of, I mean, I don't know, it sounds like Stadia could have bought you or Google could have bought you for Stadia and that never happened? Sharif: It never happened. Yeah, it didn't end up working out for a few reasons. The biggest thing was internet bandwidth. So a lot of the hosts, the GPU hosts had lots of GPUs, but average upload bandwidth in the United States is only 35 megabits per second, I think. And like a 4K stream needs like a minimum of 15 to 20 megabits per second. So you could really only utilize one of those GPUs, even if they had like 60 or 100. [05:00] The GPT3 Moment and Building Debuild Swyx: And then you went to debuild July 2020, is the date that I have. I'm actually kind of just curious, like what was your GPT-3 aha moment? When were you like GPT-3-pilled? Sharif: Okay, so I first heard about it because I was also working on another chatbot. So this was like after, like everything ties back to this chatbot I'm trying to make. This was after working on VectorDash. I was just like hacking on random projects. I wanted to make the chatbot using not really GPT-2, but rather just like it would be pre-programmed. It was pretty much you would give it a goal and then it would ask you throughout the week how much progress you're making to that goal. So take your unstructured response, usually a reply to a text message, and then it would like, plot it for you in like a table and you could see your progress over time. It could be for running or tracking calories. But I wanted to use GPT-3 to make it seem more natural because I remember someone on Bookface, which is still YC's internal forum. They posted and they were like, OpenAI just released AGI and it's GPT-3. I asked it like a bunch of logic puzzles and it solved them all perfectly. And I was like, what? How's no one else talking about this? Like this is either like the greatest thing ever that everyone is missing or like it's not that good. So like I tweeted out if anyone could get me access to it. A few hours later, Greg Brockman responded. Swyx: He is everywhere. Sharif: He's great. Yeah, he's on top of things. And yeah, by that afternoon, I was like messing around with the API and I was like, wow, this is incredible. You could chat with fake people or people that have passed away. You could like, I remember the first conversation I did was this is a chat with Steve Jobs a
It’s now almost 6 months since Google declared Code Red, and the results — Jeff Dean’s recap of 2022 achievements and a mass exodus of the top research talent that contributed to it in January, Bard’s rushed launch in Feb, a slick video showing Google Workspace AI features and confusing doubly linked blogposts about PaLM API in March, and merging Google Brain and DeepMind in April — have not been inspiring. Google’s internal panic is in full display now with the surfacing of a well written memo, written by software engineer Luke Sernau written in early April, revealing internal distress not seen since Steve Yegge’s infamous Google Platforms Rant. Similar to 2011, the company’s response to an external challenge has been to mobilize the entire company to go all-in on a (from the outside) vague vision. Google’s misfortunes are well understood by now, but the last paragraph of the memo: “We have no moat, and neither does OpenAI”, was a banger of a mic drop. Combine this with news this morning that OpenAI lost $540m last year and will need as much as $100b more funding (after the complex $10b Microsoft deal in Jan), and the memo’s assertion that both Google and OpenAI have “no moat” against the mighty open source horde have gained some credibility in the past 24 hours. Many are criticising this memo privately: * A CEO commented to me yesterday that Luke Sernau does not seem to work in AI related parts of Google and “software engineers don’t understand moats”. * Emad Mostaque, himself a perma-champion of open source and open models, has repeatedly stated that “Closed models will always outperform open models” because closed models can just wrap open ones. * Emad has also commented on the moats he does see: “Unique usage data, Unique content, Unique talent, Unique product, Unique business model”, most of which Google does have, and OpenAI less so (though it is winning on the talent front) * Sam Altman famously said that “very few to no one is Silicon Valley has a moat - not even Facebook” (implying that moats don’t actually matter, and you should spend your time thinking about more important things) * It is not actually clear what race the memo thinks Google and OpenAI are in vs Open Source. Neither are particularly concerned about running models locally on phones, and they are perfectly happy to let “a crazy European alpha male” run the last mile for them while they build actually monetizable cloud infrastructure. However moats are of intense interest by everybody keen on productized AI, cropping up in every Harvey, Jasper, and general AI startup vs incumbent debate. It is also interesting to take the memo at face value and discuss the searing hot pace of AI progress in open source. We hosted this discussion yesterday with Simon Willison, who apart from being an incredible communicator also wrote a great recap of the No Moat memo. 2,800 have now tuned in on Twitter Spaces, but we have taken the audio and cleaned it up here. Enjoy! Timestamps * [00:00:00] Introducing the Google Memo * [00:02:48] Open Source > Closed? * [00:05:51] Running Models On Device * [00:07:52] LoRA part 1 * [00:08:42] On Moats - Size, Data * [00:11:34] Open Source Models are Comparable on Data * [00:13:04] Stackable LoRA * [00:19:44] The Need for Special Purpose Optimized Models * [00:21:12] Modular - Mojo from Chris Lattner * [00:23:33] The Promise of Language Supersets * [00:28:44] Google AI Strategy * [00:29:58] Zuck Releasing LLaMA * [00:30:42] Google Origin Confirmed * [00:30:57] Google's existential threat * [00:32:24] Non-Fiction AI Safety ("y-risk") * [00:35:17] Prompt Injection * [00:36:00] Google vs OpenAI * [00:41:04] Personal plugs: Simon and Travis Transcripts [00:00:00] Introducing the Google Memo [00:00:00] Simon Willison: So, yeah, this is a document, which Kate, which I first saw at three o'clock this morning, I think. It claims to be leaked from Google. There's good reasons to believe it is leaked from Google, and to be honest, if it's not, it doesn't actually matter because the quality of the analysis, I think stands alone. [00:00:15] If this was just a document by some anonymous person, I'd still think it was interesting and worth discussing. And the title of the document is We Have No Moat and neither does Open ai. And the argument it makes is that while Google and OpenAI have been competing on training bigger and bigger language models, the open source community is already starting to outrun them, given only a couple of months of really like really, really serious activity. [00:00:41] You know, Facebook lama was the thing that really kicked us off. There were open source language models like Bloom before that some G P T J, and they weren't very impressive. Like nobody was really thinking that they were. Chat. G P T equivalent Facebook Lama came out in March, I think March 15th. And was the first one that really sort of showed signs of being as capable maybe as chat G P T. [00:01:04] My, I don't, I think all of these models, they've been, the analysis of them has tend to be a bit hyped. Like I don't think any of them are even quite up to GT 3.5 standards yet, but they're within spitting distance in some respects. So anyway, Lama came out and then, Two weeks later Stanford Alpaca came out, which was fine tuned on top of Lama and was a massive leap forward in terms of quality. [00:01:27] And then a week after that Vicuna came out, which is to this date, the the best model I've been able to run on my own hardware. I, on my mobile phone now, like, it's astonishing how little resources you need to run these things. But anyway, the the argument that this paper made, which I found very convincing is it only took open source two months to get this far. [00:01:47] It's now every researcher in the world is kicking it on new, new things, but it feels like they're being there. There are problems that Google has been trying to solve that the open source models are already addressing, and really how do you compete with that, like with your, it's closed ecosystem, how are you going to beat these open models with all of this innovation going on? [00:02:04] But then the most interesting argument in there is it talks about the size of models and says that maybe large isn't a competitive advantage, maybe actually a smaller model. With lots of like different people fine tuning it and having these sort of, these LoRA l o r a stackable fine tuning innovations on top of it, maybe those can move faster. [00:02:23] And actually having to retrain your giant model every few months from scratch is, is way less useful than having small models that you can tr you can fine tune in a couple of hours on laptop. So it's, it's fascinating. I basically, if you haven't read this thing, you should read every word of it. It's not very long. [00:02:40] It's beautifully written. Like it's, it's, I mean, If you try and find the quotable lines in it, almost every line of it's quotable. Yeah. So, yeah, that's that, that, that's the status of this [00:02:48] Open Source > Closed? [00:02:48] swyx: thing. That's a wonderful summary, Simon. Yeah, there, there's so many angles we can take to this. I, I'll just observe one, one thing which if you think about the open versus closed narrative, Ima Mok, who is the CEO of Stability, has always been that open will trail behind closed, because the closed alternatives can always take. [00:03:08] Learnings and lessons from open source. And this is the first highly credible statement that is basically saying the exact opposite, that open source is moving than, than, than closed source. And they are scared. They seem to be scared. Which is interesting, [00:03:22] Travis Fischer: Travis. Yeah, the, the, the, a few things that, that I'll, I'll, I'll say the only thing which can keep up with the pace of AI these days is open source. [00:03:32] I think we're, we're seeing that unfold in real time before our eyes. And. You know, I, I think the other interesting angle of this is to some degree LLMs are they, they don't really have switching costs. They are going to be, become commoditized. At least that's, that's what a lot of, a lot of people kind of think to, to what extent is it Is it a, a rate in terms of, of pricing of these things? [00:03:55] , and they all kind of become roughly the, the, the same in, in terms of their, their underlying abilities. And, and open source is gonna, gonna be actively pushing, pushing that forward. And, and then this is kind of coming from, if it is to be believed the kind of Google or an insider type type mentality around you know, where is the actual competitive advantage? [00:04:14] What should they be focusing on? How can they get back in into the game? When you know, when, when, when, when currently the, the, the external view of, of Google is that they're kind of spinning their wheels and they have this code red,, and it's like they're, they're playing catch up already. [00:04:28] Like how could they use the open source community and work with them, which is gonna be really, really hard you know, from a structural perspective given Google's place in the ecosystem. But a, a lot, lot, a lot of jumping off points there. [00:04:42] Alessio Fanelli: I was gonna say, I think the Post is really focused on how do we get the best model, but it's not focused on like, how do we build the best product around it. [00:04:50] A lot of these models are limited by how many GPUs you can get to run them and we've seen on traditional open source, like everybody can use some of these projects like Kafka and like Alaska for free. But the reality is that not everybody can afford to run the infrastructure needed for it. [00:05:05] So I, I think like the main takeaway that I have from this is like, A lot of the moats are probably around just getting the, the sand, so to speak, and having the GPUs to actually serve these models. Because even if the best model is
Latent Space is popping off! Welcome to the over 8500 latent space explorers who have joined us. Join us this month at various events in SF and NYC, or start your own! This post spent 22 hours at the top of Hacker News. As announced during their Developer Day celebrating their $100m fundraise following their Google partnership, Replit is now open sourcing its own state of the art code LLM: replit-code-v1-3b (model card, HF Space), which beats OpenAI’s Codex model on the industry standard HumanEval benchmark when finetuned on Replit data (despite being 77% smaller) and more importantly passes AmjadEval (we’ll explain!) We got an exclusive interview with Reza Shabani, Replit’s Head of AI, to tell the story of Replit’s journey into building a data platform, building GhostWriter, and now training their own LLM, for 22 million developers! 8 minutes of this discussion go into a live demo discussing generated code samples - which is always awkward on audio. So we’ve again gone multimodal and put up a screen recording here where you can follow along on the code samples! Recorded in-person at the beautiful StudioPod studios in San Francisco. Full transcript is below the fold. We would really appreciate if you shared our pod with friends on Twitter, LinkedIn, Mastodon, Bluesky, or your social media poison of choice! Timestamps * [00:00:21] Introducing Reza * [00:01:49] Quantitative Finance and Data Engineering * [00:11:23] From Data to AI at Replit * [00:17:26] Replit GhostWriter * [00:20:31] Benchmarking Code LLMs * [00:23:06] AmjadEval live demo * [00:31:21] Aligning Models on Vibes * [00:33:04] Beyond Chat & Code Completion * [00:35:50] Ghostwriter Autonomous Agent * [00:38:47] Releasing Replit-code-v1-3b * [00:43:38] The YOLO training run * [00:49:49] Scaling Laws: from Kaplan to Chinchilla to LLaMA * [00:52:43] MosaicML * [00:55:36] Replit's Plans for the Future (and Hiring!) * [00:59:05] Lightning Round Show Notes * Reza Shabani on Twitter and LinkedIn * also Michele Catasta and Madhav Singhal * Michele Catasta’s thread on the release of replit-code-v1-3b * Intro to Replit Ghostwriter * Replit Ghostwriter Chat and Building Ghostwriter Chat * Reza on how to train your own LLMs (their top blog of all time) * Our Benchmarks 101 episode where we discussed HumanEval * AmjadEval live demo * Nat.dev * MosaicML CEO Naveen Rao on Replit’s LLM * MosaicML Composer + FSDP code * Replit’s AI team is hiring in North America timezone - Fullstack engineer, Applied AI/ML, and other roles! Transcript [00:00:00] Alessio Fanelli: Hey everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my co-host, swyx, writer and editor of Latent Space. [00:00:21] Introducing Reza [00:00:21] swyx: Hey and today we have Reza Shabani, Head of AI at Replit. Welcome to the studio. Thank you. Thank you for having me. So we try to introduce people's bios so you don't have to repeat yourself, but then also get a personal side of you. [00:00:34] You got your PhD in econ from Berkeley, and then you were a startup founder for a bit, and, and then you went into systematic equity trading at BlackRock in Wellington. And then something happened and you were now head of AI at Relet. What should people know about you that might not be apparent on LinkedIn? [00:00:50] One thing [00:00:51] Reza Shabani: that comes up pretty often is whether I know how to code. Yeah, you'd be shocked. A lot of people are kind of like, do you know how to code? When I was talking to Amjad about this role, I'd originally talked to him, I think about a product role and, and didn't get it. Then he was like, well, I know you've done a bunch of data and analytics stuff. [00:01:07] We need someone to work on that. And I was like, sure, I'll, I'll do it. And he was like, okay, but you might have to know how to code. And I was like, yeah, yeah, I, I know how to code. So I think that just kind of surprises people coming from like Ancon background. Yeah. Of people are always kind of like, wait, even when people join Relet, they're like, wait, does this guy actually know how to code? [00:01:28] Is he actually technical? Yeah. [00:01:30] swyx: You did a bunch of number crunching at top financial companies and it still wasn't [00:01:34] Reza Shabani: obvious. Yeah. Yeah. I mean, I, I think someone like in a software engineering background, cuz you think of finance and you think of like calling people to get the deal done and that type of thing. [00:01:43] No, it's, it's not that as, as you know, it's very very quantitative. Especially what I did in, in finance, very quantitative. [00:01:49] Quantitative Finance and Data Engineering [00:01:49] swyx: Yeah, so we can cover a little bit of that and then go into the rapid journey. So as, as you, as you know, I was also a quantitative trader on the sell side and the buy side. And yeah, I actually learned Python there. [00:02:01] I learned my, I wrote my own data pipelines there before airflow was a thing, and it was just me writing running notebooks and not version controlling them. And it was a complete mess, but we were managing a billion dollars on, on my crappy code. Yeah, yeah. What was it like for you? [00:02:17] Reza Shabani: I guess somewhat similar. [00:02:18] I, I started the journey during grad school, so during my PhD and my PhD was in economics and it was always on the more data intensive kind of applied economic side. And, and specifically financial economics. And so what I did for my dissertation I recorded cnbc, the Financial News Network for 10 hours a day, every day. [00:02:39] Extracted the close captions from the video files and then used that to create a second by second transcript of, of cmbc, merged that on with high frequency trading, quote data and then looked at, you know, went in and did some, some nlp, tagging the company names, and and then looked at the price response or the change in price and trading volume in the seconds after a company was mentioned. [00:03:01] And, and this was back in. 2009 that I was doing this. So before cloud, before, before a lot of Python actually. And, and definitely before any of these packages were available to make this stuff easy. And that's where, where I had to really learn to code, like outside of you know, any kind of like data programming languages. [00:03:21] That's when I had to learn Python and had to learn all, all of these other skills to work it with data at that, at that scale. So then, you know, I thought I wanted to do academia. I did terrible on the academic market because everyone looked at my dissertation. They're like, this is cool, but this isn't economics. [00:03:37] And everyone in the computer science department was actually way more interested in it. Like I, I hung out there more than in the econ department and You know, didn't get a single academic offer. Had two offer. I think I only applied to like two industry jobs and got offers from both of them. [00:03:53] They, they saw value in it. One of them was BlackRock and turned it down to, to do my own startup, and then went crawling back two and a half years later after the startup failed. [00:04:02] swyx: Something on your LinkedIn was like you're trading Chinese news tickers or something. Oh, yeah. I forget, [00:04:07] Reza Shabani: forget what that was. [00:04:08] Yeah, I mean oh. There, there was so much stuff. Honestly, like, so systematic active equity at, at BlackRock is, was such an amazing. Group and you just end up learning so much and the, and the possibilities there. Like when you, when you go in and you learn the types of things that they've been trading on for years you know, like a paper will come out in academia and they're like, did you know you can use like this data on searches to predict the price of cars? [00:04:33] And it's like, you go in and they've been trading on that for like eight years. Yeah. So they're, they're really ahead of the curve on, on all of that stuff. And the really interesting stuff that I, that I found when I went in was all like, related to NLP and ml a lot of like transcript data, a lot of like parsing through the types of things that companies talk about, whether an analyst reports, conference calls, earnings reports and the devil's really in the details about like how you make sense of, of that information in a way that, you know, gives you insight into what the company's doing and, and where the market is, is going. [00:05:08] I don't know if we can like nerd out on specific strategies. Yes. Let's go, let's go. What, so one of my favorite strategies that, because it never, I don't think we ended up trading on it, so I can probably talk about it. And it, it just kind of shows like the kind of work that you do around this data. [00:05:23] It was called emerging technologies. And so the whole idea is that there's always a new set of emerging technologies coming onto the market and the companies that are ahead of that curve and stay up to date on on the latest trends are gonna outperform their, their competitors. [00:05:38] And that's gonna reflect in the, in the stock price. So when you have a theory like that, how do you actually turn that into a trading strategy? So what we ended up doing is, well first you have to, to determine what are the emergent technologies, like what are the new up and coming technologies. [00:05:56] And so we actually went and pulled data on startups. And so there's like startups in Silicon Valley. You have all these descriptions of what they do, and you get that, that corpus of like when startups were getting funding. And then you can run non-negative matrix factorization on it and create these clusters of like what the various Emerging technologies are, and you have this all the way going back and you have like social media back in like 2008 when Facebook was, was blowing up. [00:06:21] And and you have things like mobile and digital advertising and and
The race is on for the first fully GPT3/4-equivalent, truly open source Foundation Model! LLaMA’s release proved that a great model could be released and run on consumer-grade hardware (see llama.cpp), but its research license prohibits businesses from running it and all it’s variants (Alpaca, Vicuna, Koala, etc) for their own use at work. So there is great interest and desire for *truly* open source LLMs that are feasible for commercial use (with far better customization, finetuning, and privacy than the closed source LLM APIs). The previous leading contenders were Eleuther’s GPT-J and Neo on the small end (FLAN-T5 (137B), PaLM (540B), and BigScience’s BLOOM (176B) on the high end. But Databricks is to my knowledge the first to release not just a cleanly licensed, high quality LLM that can run on affordable devices, but also a simple Databricks notebook that can be customized to be finetuned for your data/desired style - for $30 in 30 minutes on one machine! Mike Conover tells the story of how a small team of Applied AI engineers got convinced Ali Ghodsi and 5,000 of their coworkers to join in the adventure of building the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use. He also indulges our questions on other recent open source LLM projects, CerebasGPT and RedPajama, though we recorded this a week before Stability’s StableLM release. Stick around to the end for some easter eggs featuring AI Drake! Recorded in-person at the beautiful StudioPod studios in San Francisco. Full transcript is below the fold. Show Notes * Mike Conover LinkedIn and Twitter * Dolly 1.0 * Dolly 2.0 * CICERO and Diplomacy * Dolly and Deepspeed * LLMops: * https://nat.dev/ * PromptLayer * HumanLoop * Spreadsheets?? * Quadratic * Alessio’s Email GPT Drafter * Open Models * Open Assistant * Cerebras GPT * RedPajama * Reflexion, Recursive Criticism and Improvement * Lightning Round * AI Product: Google Maps * AI People: EleutherAI, Huggingface’s Stas Bekman * AI Prediction: Open LLaMA reproduction, AI Twins of People (AI Drake), Valuing Perplexity * Request for Startups: LLMOps/Benchmarks, Trail Mapping Timestamps * [00:00:21] Introducing Mike Conover * [00:03:10] Dolly 1.0 * [00:04:18] Making Dolly * [00:06:12] Dolly 2.0 * [00:09:28] Gamifying Instruction Tuning * [00:11:36] Summarization - Thumbnails for Language * [00:15:11] CICERO and Geopolitical AI Agents * [00:17:09] Datasets vs Intentional Design * [00:21:44] Biological Basis of AI * [00:23:27] Training Your Own LLMs * [00:28:21] You May Not Need a Large Model * [00:29:59] Good LLM Use cases * [00:31:33] Dolly Cost $30 on Databricks * [00:36:06] Databricks Open Source * [00:37:31] LLMOps and Prompt Tooling * [00:42:26] "I'm a Sheets Maxi" * [00:44:19] AI and Workplace Productivity * [00:47:02] OpenAssistant * [00:47:41] CerebrasGPT * [00:51:35] RedPajama * [00:54:07] Why Dolly > OpenAI GPT * [00:56:19] Open Source Licensing for AI Models * [00:57:09] Why Open Source Models? * [00:58:05] Moving Models * [01:00:34] Learning in a Simulation * [01:01:28] Why Model Reflexion and Self Criticism Works * [01:03:51] Lightning Round Transcripts [00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio Partner and CT and Residence and Decibel Partners. I'm Joan Bama, cohost swyx Brighter and Editor of Space. Welcome, Mike. [00:00:21] Introducing Mike Conover [00:00:21] Hey, pleasure to be here. Yeah, so [00:00:23] we tend to try to introduce you so that you don't have to introduce yourself. Yep. [00:00:27] But then we also ask you to fill in the blanks. So you are currently a, uh, staff software engineer at Databricks. Uh, but you got your PhD at Indiana on the University of Bloomington in Complex Systems analysis where you did some, uh, analysis of clusters on, on Twitter, which I found pretty interesting. [00:00:43] Yeah. Uh, I highly recommend people checking that out if you're interested in getting information from indirect sources or I, I don't know how you describe it. Yes. Yeah. And then you went to LinkedIn working on. Homepage News, relevance, and then SkipFlag, which is a smart enterprise knowledge graph, which was then acquired, uh, by Workday, where you became director of machine learning engineering and now your Databricks. [00:01:06] So that's the quick bio and we can kind of go over Yeah. Step by step. But, uh, what's not on your LinkedIn that people [00:01:12] should know about you? So, because I worked at LinkedIn, that's actually how new hires introduce themselves at LinkedIn is this question. So I, okay. I have a pat answer to it. Uhhuh. Um, I love getting off trail in the backcountry. [00:01:25] Okay. And I, you know, I think that the sort of like radical responsibility associated to that is clarifies the mind. And I think that the, the things that I really like about machine learning engineering and sort of the topology of high-dimensional spaces kind of manifest when you think about a topographic mat as a contour plot. [00:01:44] You know, it's a two-dimensional projection of a three-dimensional space and it's very much like looking at information visualizations and you're trying to relate your. Localized perception of the environment around you and the contours of, uh, ridges that you see, or basins that you might go into and you're like, there's that little creek down there. [00:02:04] And relate that to the projection that you see on the map. I think it's physically demanding. It's intellectually challenging. It's natural. Beauty is a big part of it, and you're generally spending time with friends, and so I just, I love that. I love that these are camping trips. Uh, multi-day. Yeah. Yeah. [00:02:21] Camping. I, I hunt too, you know, I, um, shoot archery, um, big game back country hunting, but yeah. You know, sometimes it's just, let's take a walk in the woods and see where it goes. [00:02:32] Oh yeah. You ever think about going on one of those, um, journeys in the, uh, the Australian Outbacks? Like where people find themselves? [00:02:40] I'm [00:02:40] a mountain. I'm a mountain guy. I like to You're mountain guy. I like to fly fish. I like to, you like to hill climb? Yeah. Like the outback seems beautiful. I think eight of the 10 most deadly snakes live in Australia. Like I'm, uh, yeah, you're good. You're good. Yeah. Yeah. [00:02:52] Yeah. Any lessons from like, Real hill climbing [00:02:55] versus machine learning, hill climbing. [00:02:56] Great Dude. It's a lot like gradient descent. Yeah, for sure, man. Um, yeah, I that I have remarked on that to myself before for sure. Yeah, I don't, I'm not sure. This is like least resistance, please. [00:03:10] Dolly 1.0 [00:03:10] That's awesome. So Dolly, you know, it's kind of come up in the last three weeks you went from a brand new project at Databricks to one of the hottest open source things out there. [00:03:19] So March 24th you had Dolly 1.0. It was a 6 billion parameters model based on GPT-J 6 billion and you saw alpaca training set to train it. First question is, why did you start with GPT-J instead of LLaMA, which was what everybody else was kind of starting from [00:03:34] at the time. Yeah, well, I mean, so, you know, we had talked about this a little before the show, but LLaMA's hard to get. [00:03:40] We had requested the model weights and had just not heard back. And you know, I think our experience with the, um, The original email alias for Dolly, before it was available on hugging face, you get hundreds of people asking for it, and I think it's like, it's easy to just not be able to handle the inbound. [00:03:56] Mm-hmm. And so like, I mean, there was a practical consideration, which is that, you know, we did not have the LLaMA weights, but additionally I think it's like much more interesting if anybody can build it. Right. And so I think that was our, um, and I had worked with the GPT-J model in the past and, and knew it to be high quality from a grammatical ness standpoint. [00:04:15] And so I think it was a reasonable choice. Mm-hmm. Yeah. [00:04:18] Making Dolly [00:04:18] Yeah. Maybe we should, we can also go into the impetus of why you started work on Dolly. Uh, you had been at Databricks for about a year. Mm-hmm. Was there, was this like a top-down directive? Was this your idea? We'll see, uh, [00:04:31] what happened? I've been working in N L P and language understanding for a fair while now. [00:04:36] I mean certainly since Skip flag back in 20 16, 20 17, we can introduce Skip flag is that's, if that's, sorry. You know, we don't have to focus too much on it, but like, this is a, an area how information moves through networks of people is a longstanding interest of mine. And we built a hack day project and I just slacked it to our c e o and I was, you know, this was when ChatGPT came out and it was an integration into the developer experience. [00:05:02] And I was like, as a user, this should exist. I want this. Mm-hmm. We should build this. It doesn't have to be us. And I mean, to our, uh, our leadership team is like 10 years into this journey, probably more than that at Databricks. And they are still. So hungry. It's wild. It's just wild to see these, these people in action, you know, this like this far into the marathon. [00:05:23] And, um, he's like, great, build it. Do make it. So, you know, and I, we had have, uh, full-time responsibilities and infrastructure forecasting and infrastructure optimization. And so we did, you know, and, um, we just started building and, you know, so we'd been working on this class of technologies for, um, several months. [00:05:46] And we had a stack that in part how we were able to kind of pivot on the balls of our feet. Uh, we repurposed a lot of existing code that we had built up, you know, in the past several quarters, um, to, to create Dolly and, and just to [00:05:58] be clear, like is this an internal stack or is this, u
The most recent YCombinator W23 batch graduated 59 companies building with Generative AI for everything from sales, support, engineering, data, and more: Many of these B2B startups will be seeking to establish an AI foothold in the enterprise. As they look to recent success, they will find Glean, started in 2019 by a group of ex-Googlers to finally solve AI-enabled enterprise search. In 2022 Sequoia led their Series C at a $1b valuation and Glean have just refreshed their website touting new logos across Databricks, Canva, Confluent, Duolingo, Samsara, and more in the Fortune 50 and announcing Enterprise-ready AI features including AI answers, Expert detection, and In-context recommendations. We talked to Deedy Das, Founding Engineer at Glean and a former Tech Lead on Google Search, on why he thinks many of these startups are solutions looking for problems, and how Glean’s holistic approach to enterprise probllem solving has brought so much success. Deedy is also just a fascinating commentator on AI current events, being both extremely qualified and great at distilling insights, so we also went over his many viral tweets diving into Google’s competitive threats, AI Startup investing, and his exposure of Indian University Exam Fraud! Show Notes * Deedy on LinkedIn and Twitter and Personal Site * Glean * Glean and Google Moma * Golinks.io * Deedy on Google vs ChatGPT * Deedy on Google Ad Revenue * Deedy on How much does it cost to train a state-of-the-art foundational LLM? * Deedy on Google LaMDA cost * Deedy’s Indian Exam Fraud Story * Lightning Round * Favorite Products: (covered in segment) * Favorite AI People: AI Pub * Predictions: Models will get faster for the same quality * Request for Products: Hybrid Email Autoresponder * Parting Takeaway: Read the research! Timestamps * [00:00:21] Introducing Deedy * [00:02:27] Introducing Glean * [00:05:41] From Syntactic to Semantic Search * [00:09:39] Why Employee Portals * [00:12:01] The Requirements of Good Enterprise Search * [00:15:26] Glean Chat? * [00:15:53] Google vs ChatGPT * [00:19:47] Search Issues: Freshness * [00:20:49] Search Issues: Ad Revenue * [00:23:17] Search Issues: Latency * [00:24:42] Search Issues: Accuracy * [00:26:24] Search Issues: Tool Use * [00:28:52] Other AI Search takes: Perplexity and Neeva * [00:30:05] Why Document QA will Struggle * [00:33:18] Investing in AI Startups * [00:35:21] Actually Interesting Ideas in AI * [00:38:13] Harry Potter IRL * [00:39:23] AI Infra Cost Math * [00:43:04] Open Source LLMs * [00:46:45] Other Modalities * [00:48:09] Exam Fraud and Generated Text Detection * [00:58:01] Lightning Round Transcript [00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO and residence at Decibel Partners. I'm joined by my, cohost swyx, writer and editor of [00:00:19] Latent Space. Yeah. Awesome. [00:00:21] Introducing Deedy [00:00:21] And today we have a special guest. It's Deedy Das from Glean. Uh, do you go by Deedy or Debarghya? I go by Deedy. Okay. [00:00:30] Uh, it's, it's a little bit easier for the rest of us to, uh, to, to spell out. And so what we typically do is I'll introduce you based on your LinkedIn profile, and then you can fill in what's not on your LinkedIn. So, uh, you graduated your bachelor's and masters in CS from Cornell. Then you worked at Facebook and then Google on search, specifically search, uh, and also leading a sports team focusing on cricket. [00:00:50] That's something that we, we can dive into. Um, and then you moved over to Glean, which is now a search unicorn in building intelligent search for the workplace. What's not on your LinkedIn that people should know about you? Firstly, [00:01:01] guys, it's a pleasure. Pleasure to be here. Thank you so much for having me. [00:01:04] What's not on my LinkedIn is probably everything that's non-professional. I think the biggest ones are I'm a huge movie buff and I love reading, so I think I get through, usually I like to get through 10 books ish a year, but I hate people who count books, so I should say the number. And increasingly, I don't like reading non-fiction books. [00:01:26] I actually do prefer reading fiction books purely for pleasure and entertainment. I think that's the biggest omission from my LinkedIn. [00:01:34] What, what's, what's something that, uh, caught your eye for fiction stuff that you would recommend people? [00:01:38] Oh, I recently, we started reading the Three Body Problem and I finished it and it's a three part series. [00:01:45] And, uh, well, my controversial take is I did not really enjoy the second part, and so I just stopped. But the first book was phenomenal. Great concept. I didn't know you could write alien fiction with physics so Well, and Chinese literature in particular has a very different cadence to it than Western literature. [00:02:03] It's very less about the, um, let's describe people and what they're all about and their likes and dislikes. And it's like, here's a person, he's a professor of physics. That's all you need to know about him. Let's continue with the story. Um, and, and I, I, I, I enjoy it. It's a very different style from, from what I'm used. [00:02:21] Yeah, I, I heard it's, uh, very highly recommended. I think it's being adapted to a TV show, so looking forward [00:02:26] to that. [00:02:27] Introducing Glean [00:02:27] Uh, so you spend now almost four years at gle. The company's not unicorn, but you were on the founding team and LMS and tech interfaces are all the reach now. But you were building this before. [00:02:38] It was cool, so to speak. Maybe tell us more about the story, how it became, and some of the technological advances you've seen. Because I think you started, the company started really close to some of the early GPT models. Uh, so you've seen a lot of it from, from day one. [00:02:53] Yeah. Well, the first thing I'll say is Glean was never started to be a. [00:02:58] Technical product looking for a solution. We were always wanted to solve a very critical problem first that we saw, not only in the companies that we'd worked in before, but in all of the companies that a lot of our, uh, a lot of the founding team had been in past their time at Google. So Google has a really neat tool that already kind of does this internally. [00:03:18] It's called MoMA, and MoMA sort of indexes everything that you'd use inside Google because they have first party API accessed who has permissions to what document and what documents exist, and they rank them with their internal search tool. It's one of those things where when you're at Google, you sort of take it for granted, but when you leave and go anywhere else, you're like, oh my God, how do I function without being able to find things that I've worked on? [00:03:42] Like, oh, I remember this guy had a presentation that he made three meetings ago and I don't remember anything about it. I don't know where he shared it. I don't know if he shared it, but I do know the, it was a, something about X and I kind of wanna find that now. So that's the core. Information retrieval problem that we had set out to tackle, and we realized when we started looking at this problem that enterprise search is actually, it's not new. [00:04:08] People have been trying to tackle enterprise search for decades. Again, pre two thousands people have been trying to build these on-prem enterprise search systems. But one thing that has really allowed us to build it well, A, you now have, well, you have distributed elastic, so that really helps you do a lot of the heavy lifting on core infra. [00:04:28] But B, you also now have API support that's really nuanced on all of the SaaS apps that you use. So back in the day, it was really difficult to integrate with a messaging app. They didn't have an api. It didn't have any way to sort of get the permissions information and get the messaging information. But now a lot of SaaS apps have really robust APIs that really let. [00:04:50] Index everything that you'd want though though. That's two. And the third sort of big macro reason why it's happening now and why we're able to do it well is the fact that the SaaS apps have just exploded. Like every company uses, you know, 10 to a hundred apps. And so just the urgent need for information, especially with, you know, remote work and work from home, it's just so critical that people expect this almost as a default that you should have in your company. [00:05:17] And a lot of our customers just say, Hey, I don't, I can't go back to a life without internal search. And I think we think that's just how it should be. So that's kind of the story about how Glean was founded and a lot of the LLM stuff. It's neat that all, a lot of that's happening at the same time that we are trying to solve this problem because it's definitely applicable to the problem we're trying to solve. [00:05:37] And I'm really excited by some of the stuff that we are able to do with it. [00:05:41] From Syntactic to Semantic Search [00:05:41] I was talking with somebody last weekend, they were saying the last couple years we're going from the web used to be syntex driven. You know, you siegal for information retrieval, going into a symantics driven where the syntax is not as important. [00:05:55] It's like the, how you actually explain the question. And uh, we just asked Sarah from Seek.ai on the previous episode and instead of doing natural language and things like that for enterprise knowledge, it's more for business use cases. So I'm curious to see, you know, The enterprise of the future, what that looks like, you know, is there gonna be way less dropdowns and kind of like, uh, SQL queries and stuff like that. [00:06:19] And it's more this virtual, almost like person that embodies the company that is like a, an LLM in a way. But how do you do that without being able to surface all the knowledge that people have in the organization? So something like Lean is, uh,
2023 is the year of Multimodal AI, and Latent Space is going multimodal too! * This podcast comes with a video demo at the 1hr mark and it’s a good excuse to launch our YouTube - please subscribe! * We are also holding two events in San Francisco — the first AI | UX meetup next week (already full; we’ll send a recap here on the newsletter) and Latent Space Liftoff Day on May 4th (signup here; but get in touch if you have a high profile launch you’d like to make). * We also joined the Chroma/OpenAI ChatGPT Plugins Hackathon last week where we won the Turing and Replit awards and met some of you in person! This post featured on Hacker News. Out of the five senses of the human body, I’d put sight at the very top. But weirdly when it comes to AI, Computer Vision has felt left out of the recent wave compared to image generation, text reasoning, and even audio transcription. We got our first taste of it with the OCR capabilities demo in the GPT-4 Developer Livestream, but to date GPT-4’s vision capability has not yet been released. Meta AI leapfrogged OpenAI and everyone else by fully open sourcing their Segment Anything Model (SAM) last week, complete with paper, model, weights, data (6x more images and 400x more masks than OpenImages), and a very slick demo website. This is a marked change to their previous LLaMA release, which was not commercially licensed. The response has been ecstatic: SAM was the talk of the town at the ChatGPT Plugins Hackathon and I was fortunate enough to book Joseph Nelson who was frantically integrating SAM into Roboflow this past weekend. As a passionate instructor, hacker, and founder, Joseph is possibly the single best person in the world to bring the rest of us up to speed on the state of Computer Vision and the implications of SAM. I was already a fan of him from his previous pod with (hopefully future guest) Beyang Liu of Sourcegraph, so this served as a personal catchup as well. Enjoy! and let us know what other news/models/guests you’d like to have us discuss! - swyx Recorded in-person at the beautiful StudioPod studios in San Francisco. Full transcript is below the fold. Show Notes * Joseph’s links: Twitter, Linkedin, Personal * Sourcegraph Podcast and Game Theory Story * Represently * Roboflow at Pioneer and YCombinator * Udacity Self Driving Car dataset story * Computer Vision Annotation Formats * SAM recap - top things to know for those living in a cave * https://segment-anything.com/ * https://segment-anything.com/demo * https://arxiv.org/pdf/2304.02643.pdf * https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/ * https://blog.roboflow.com/segment-anything-breakdown/ * https://ai.facebook.com/datasets/segment-anything/ * Ask Roboflow https://ask.roboflow.ai/ * GPT-4 Multimodal https://blog.roboflow.com/gpt-4-impact-speculation/ Cut for time: * WSJ mention * Des Moines Register story * All In Pod: timestamped mention * In Forbes: underrepresented investors in Series A * Roboflow greatest hits * https://blog.roboflow.com/mountain-dew-contest-computer-vision/ * https://blog.roboflow.com/self-driving-car-dataset-missing-pedestrians/ * https://blog.roboflow.com/nerualhash-collision/ and Apple CSAM issue * https://www.rf100.org/ Timestamps * [00:00:19] Introducing Joseph * [00:02:28] Why Iowa * [00:05:52] Origin of Roboflow * [00:16:12] Why Computer Vision * [00:17:50] Computer Vision Use Cases * [00:26:15] The Economics of Annotation/Segmentation * [00:32:17] Computer Vision Annotation Formats * [00:36:41] Intro to Computer Vision & Segmentation * [00:39:08] YOLO * [00:44:44] World Knowledge of Foundation Models * [00:46:21] Segment Anything Model * [00:51:29] SAM: Zero Shot Transfer * [00:51:53] SAM: Promptability * [00:53:24] SAM: Model Assisted Labeling * [00:56:03] SAM doesn't have labels * [00:59:23] Labeling on the Browser * [01:00:28] Roboflow + SAM Video Demo * [01:07:27] Future Predictions * [01:08:04] GPT4 Multimodality * [01:09:27] Remaining Hard Problems * [01:13:57] Ask Roboflow (2019) * [01:15:26] How to keep up in AI Transcripts [00:00:00] Hello everyone. It is me swyx and I'm here with Joseph Nelson. Hey, welcome to the studio. It's nice. Thanks so much having me. We, uh, have a professional setup in here. [00:00:19] Introducing Joseph [00:00:19] Joseph, you and I have known each other online for a little bit. I first heard about you on the Source Graph podcast with bian and I highly, highly recommend that there's a really good game theory story that is the best YC application story I've ever heard and I won't tease further cuz they should go listen to that. [00:00:36] What do you think? It's a good story. It's a good story. It's a good story. So you got your Bachelor of Economics from George Washington, by the way. Fun fact. I'm also an econ major as well. You are very politically active, I guess you, you did a lot of, um, interning in political offices and you were responding to, um, the, the, the sheer amount of load that the Congress people have in terms of the, the support. [00:01:00] So you built, representing, which is Zendesk for Congress. And, uh, I liked in your source guide podcast how you talked about how being more responsive to, to constituents is always a good thing no matter what side of the aisle you're on. You also had a sideline as a data science instructor at General Assembly. [00:01:18] As a consultant in your own consultancy, and you also did a bunch of hackathon stuff with Magic Sudoku, which is your transition from N L P into computer vision. And apparently at TechCrunch Disrupt, disrupt in 2019, you tried to add chess and that was your whole villain origin story for, Hey, computer vision's too hard. [00:01:36] That's full, the platform to do that. Uh, and now you're co-founder c e o of RoboFlow. So that's your bio. Um, what's not in there that [00:01:43] people should know about you? One key thing that people realize within maybe five minutes of meeting me, uh, I'm from Iowa. Yes. And it's like a funnily novel thing. I mean, you know, growing up in Iowa, it's like everyone you know is from Iowa. [00:01:56] But then when I left to go to school, there was not that many Iowans at gw and people were like, oh, like you're, you're Iowa Joe. Like, you know, how'd you find out about this school out here? I was like, oh, well the Pony Express was running that day, so I was able to send. So I really like to lean into it. [00:02:11] And so you kind of become a default ambassador for places that. People don't meet a lot of other people from, so I've kind of taken that upon myself to just make it be a, a part of my identity. So, you know, my handle everywhere Joseph of Iowa, like I I, you can probably find my social security number just from knowing that that's my handle. [00:02:25] Cuz I put it plastered everywhere. So that's, that's probably like one thing. [00:02:28] Why Iowa [00:02:28] What's your best pitch for Iowa? Like why is [00:02:30] Iowa awesome? The people Iowa's filled with people that genuinely care. You know, if you're waiting a long line, someone's gonna strike up a conversation, kinda ask how you were Devrel and it's just like a really genuine place. [00:02:40] It was a wonderful place to grow up too at the time, you know, I thought it was like, uh, yeah, I was kind of embarrassed and then be from there. And then I actually kinda looking back it's like, wow, you know, there's good schools, smart people friendly. The, uh, high school that I went to actually Ben Silverman, the CEO and, or I guess former CEO and co-founder of Pinterest and I have the same teachers in high school at different. [00:03:01] The co-founder, or excuse me, the creator of crispr, the gene editing technique, Dr. Jennifer. Doudna. Oh, so that's the patent debate. There's Doudna. Oh, and then there's Fang Zang. Uh, okay. Yeah. Yeah. So Dr. Fang Zang, who I think ultimately won the patent war, uh, but is also from the same high school. [00:03:18] Well, she won the patent, but Jennifer won the [00:03:20] prize. [00:03:21] I think that's probably, I think that's probably, I, I mean I looked into it a little closely. I think it was something like she won the patent for CRISPR first existing and then Feng got it for, uh, first use on humans, which I guess for commercial reasons is the, perhaps more, more interesting one. But I dunno, biolife Sciences, is that my area of expertise? [00:03:38] Yep. Knowing people that came from Iowa that do cool things, certainly is. Yes. So I'll claim it. Um, but yeah, I, I, we, um, at Roble actually, we're, we're bringing the full team to Iowa for the very first time this last week of, of April. And, well, folks from like Scotland all over, that's your company [00:03:54] retreat. [00:03:54] The Iowa, [00:03:55] yeah. Nice. Well, so we do two a year. You know, we've done Miami, we've done. Some of the smaller teams have done like Nashville or Austin or these sorts of places, but we said, you know, let's bring it back to kinda the origin and the roots. Uh, and we'll, we'll bring the full team to, to Des Moines, Iowa. [00:04:13] So, yeah, like I was mentioning, folks from California to Scotland and many places in between are all gonna descend upon Des Moines for a week of, uh, learning and working. So maybe you can check in with those folks. If, what do they, what do they decide and interpret about what's cool. Our state. Well, one thing, are you actually headquartered in Des Moines on paper? [00:04:30] Yes. Yeah. [00:04:30] Isn't that amazing? That's like everyone's Delaware and you're like, [00:04:33] so doing research. Well, we're, we're incorporated in Delaware. Okay. We we're Delaware Sea like, uh, most companies, but our headquarters Yeah. Is in Des Moines. And part of that's a few things. One, it's like, you know, there's this nice Iowa pride. [00:04:43] And second is, uh, Brad and I both grew up in Brad Mc, co-founder and I grew up in, in Des Moines. And we met each other
We’re trying a new format, inspired by Acquired.fm! No guests, no news, just highly prepared, in-depth conversation on one topic that will level up your understanding. We aren’t experts, we are learning in public. Please let us know what we got wrong and what you think of this new format! When you ask someone to break down the basic ingredients of a Large Language Model, you’ll often hear a few things: You need lots of data. You need lots of compute. You need models with billions of parameters. Trust the Bitter Lesson, more more more, scale is all you need. Right? Nobody ever mentions the subtle influence of great benchmarking. LLM Benchmarks mark our progress in building artificial intelligences, progressing from * knowing what words go with others (1985 WordNet) * recognizing names and entities (2004 Enron Emails) * and image of numbers, letters, and clothes (1998-2017 MNIST) * language translation (2002 BLEU → 2020 XTREME) * more and more images (2009 ImageNet, CIFAR) * reasoning in sentences (2016 LAMBADA) and paragraphs (2019 AI2RC, DROP) * stringing together whole sentences (2018 GLUE and SuperGLUE) * question answering (2019 CoQA) * having common sense (2018 Swag and HellaSwag, 2019 WinoGrande) * knowledge of all human tasks and professional exams (2021 MMLU) * knowing everything (2022 BIG-Bench) People who make benchmarks are the unsung heroes of LLM research, because they dream up ever harder tests that last ever shorter periods of time. In our first AI Fundamentals episode, we take a trek through history to try to explain what we have learned about LLM Benchmarking, and what issues we have discovered with them. There are way, way too many links and references to include in this email. You can follow along the work we did for our show prep in this podcast’s accompanying repo, with all papers and selected tests pulled out. Enjoy and please let us know what other fundamentals topics you’d like us to cover! Timestamps * [00:00:21] Benchmarking Questions * [00:03:08] Why AI Benchmarks matter * [00:06:02] Introducing Benchmark Metrics * [00:08:14] Benchmarking Methodology * [00:09:45] 1985-1989: WordNet and Entailment * [00:12:44] 1998-2004 Enron Emails and MNIST * [00:14:35] 2009-14: ImageNet, CIFAR and the AlexNet Moment for Deep Learning * [00:17:42] 2018-19: GLUE and SuperGLUE - Single Sentence, Similarity and Paraphrase, Inference * [00:23:21] 2018-19: Swag and HellaSwag - Common Sense Inference * [00:26:07] Aside: How to Design Benchmarks * [00:26:51] 2021: MMLU - Human level Professional Knowledge * [00:29:39] 2021: HumanEval - Code Generation * [00:31:51] 2020: XTREME - Multilingual Benchmarks * [00:35:14] 2022: BIG-Bench - The Biggest of the Benches * [00:37:40] EDIT: Why BIG-Bench is missing from GPT4 Results * [00:38:25] Issue: GPT4 vs the mystery of the AMC10/12 * [00:40:28] Issue: Data Contamination * [00:42:13] Other Issues: Benchmark Data Quality and the Iris data set * [00:45:44] Tradeoffs of Latency, Inference Cost, Throughput * [00:49:45] Conclusion Transcript [00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO and residence at Decibel Partners, and I'm joined by my co-host, swyx writer and editor of Latent Space. [00:00:21] Benchmarking Questions [00:00:21] Up until today, we never verified that we're actually humans to you guys. So we'd have one good thing to do today would be run ourselves through some AI benchmarks and see if we are humans. [00:00:31] Indeed. So, since I got you here, Sean, I'll start with one of the classic benchmark questions, which is what movie does this emoji describe? The emoji set is little Kid Bluefish yellow, bluefish orange Puffer fish. One movie does that. I think if you added an octopus, it would be slightly easier. But I prepped this question so I know it's finding Nemo. [00:00:57] You are so far a human. Second one of these emoji questions instead, depicts a superhero man, a superwoman, three little kids, one of them, which is a toddler. So you got this one too? Yeah. It's one of my favorite movies ever. It's the Incredibles. Uh, second one was kind of a letdown, but the first is a. [00:01:17] Awesome. Okay, I'm gonna ramp it up a little bit. So let's ask something that involves a little bit of world knowledge. So when you drop a ball from rest, it accelerates downward at 9.8 meters per second if you throw it downward instead, assuming no air resistance, so you're throwing it down instead of dropping it, it's acceleration immediately after leaving your hand is a 9.8 meters per second. [00:01:38] B, more than 9.8 meters per second. C less than 9.8 meters per second. D cannot say unless the speed of the throw is. I would say B, you know, I started as a physics major and then I changed, but I think I, I got enough from my first year. That is B Yeah. Even proven that you're human cuz you got it wrong. [00:01:56] Whereas the AI got it right is 9.8 meters per second. The gravitational constant, uh, because you are no longer accelerating after you leave the hand. The question says if you throw it downward after leaving your hand, what is the. It is, it goes back to the gravitational constant, which is 9.8 meters per, I thought you said you were a physics major. [00:02:17] That's why I changed. So I'm a human. I'm a human. You're human. You're human. But you, you got them all right. So I can't ramp it up. I can't ramp it up. So, Assuming, uh, the AI got all of that right, you would think that AI will get this one wrong. Mm-hmm. Because it's just predicting the next token, right? [00:02:31] Right. In the complex Z plane, the set of points satisfying the equation. Z squared equals modulars. Z squared is A, a pair points B circle, C, a half line D, online D square. The processing is, this is going on in your head. You got minus three. A line. This is hard. Yes, that is. That is a line. Okay. What's funny is that I think if, if an AI was doing this, it would take the same exact amount of time to answer this as it would every single other word. [00:03:05] Cuz it's computationally the same to them. Right. [00:03:08] Why AI Benchmarks matter [00:03:08] Um, so anyway, if you haven't caught on today, we're doing our first, uh, AI fundamentals episode, which just the two of us, no guess because we wanted to go deep on one topic and the topic. AI benchmarks. So why are we focusing on AI benchmarks? So, GPT4 just came out last week and every time a new model comes out, All we hear about is it's so much better than the previous model on benchmark X, on benchmark Y. [00:03:33] It performs better on this, better on that. But most people don't actually know what actually goes on under these benchmarks. So we thought it would be helpful for people to put these things in context. And also benchmarks evolved. Like the more the models improve, the harder the benchmarks get. Like I couldn't even get one of the questions right. [00:03:52] So obviously they're working and you'll see that. From the 1990s where some of the first ones came out to day, the, the difficulty of them is truly skyrocketed. So we wanna give a, a brief history of that and leave you with a mental model on, okay, what does it really mean to do well at X benchmark versus Y benchmark? [00:04:13] Um, so excited to add that in. I would also say when you ask people what are the ingredients going into a large language model, they'll talk to you about the data. They'll talk to you about the neural nets, they'll talk to you about the amount of compute, you know, how many GPUs are getting burned based on this. [00:04:30] They never talk to you about the benchmarks. And it's actually a shame because they're so influential. Like that is the entirety of how we judge whether a language model is better than the other. Cuz a language model can do anything out of. Potentially infinite capabilities. How do you judge one model versus another? [00:04:48] How do you know you're getting better? And so I think it's an area of intense specialization. Also, I think when. Individuals like us, you know, we sort of play with the language models. We are basically doing benchmarks. We're saying, look, it's, it's doing this awesome thing that I found. Guess what? There have been academics studying this for 20 years who have, uh, developed a science to this, and we can actually benefit from studying what they have done. [00:05:10] Yep. And obviously the benchmarks also drive research, you know, in a way whenever you're working on, in a new model. Yeah. The benchmark kind of constraints what you're optimizing for in a way. Because if you've read a paper and it performs worse than all the other models, like you're not gonna publish it. [00:05:27] Yeah. So in a way, there's bias in the benchmark itself. Yeah. Yeah. We'll talk a little bit about that. Right. Are we optimizing for the right things when we over-optimize for a single benchmark over over some others? And also curiously, when GPT4 was released, they emitted some very. Commonplace industry benchmarks. [00:05:44] So the way that you present yourself, it is a form of marketing. It is a form of trying to say you're better than something else. And, and trying to explain where you think you, you do better. But it's very hard to verify as well because there are certain problems with reproducing benchmarks, uh, especially when you come to large language models. [00:06:02] Introducing Benchmark Metrics [00:06:02] So where do we go from here? Should we go over the, the major concept? Yeah. When it comes to benchmark metrics, we get three main measures. Accuracy, precision, recall accuracy is just looking at how many successful prediction the model does. Precision is the ratio of true positives, meaning how many of them are good compared to the overall amount of predictions made Versus recall is what proportion of the positives were identified. [00:06:31] So if you think. Spotify playlist to maybe make it a little more approachable,
We are excited to feature our first academic on the pod! I first came across Shreya when her tweetstorm of MLOps principles went viral: Shreya’s holistic approach to production grade machine learning has taken her from Stanford to Facebook and Google Brain, being the first ML Engineer at Viaduct, and now a PhD in Databases (trust us, its relevant) at UC Berkeley with the new EPIC Data Lab. If you know Berkeley’s history in turning cutting edge research into gamechanging startups, you should be as excited as we are! Recorded in-person at the beautiful StudioPod studios in San Francisco. Full transcript is below the fold. Edit from the future: Shreya obliged us with another round of LLMOps hot takes after the pod! Other Links * Shreya’s About: https://www.shreya-shankar.com/about/ * Berkeley Sky Computing Lab - Utility Computing for the Cloud * Berkeley Epic Data Lab - low-code and no-code interfaces for data work, powered by next-generation predictive programming techniques * Shreya’s ML Principles * Grounded Theory * Lightning Round: * Favorite AI Product: Stability Dreamstudio * 1 Year Prediction: Data management platforms * Request for startup: Design system generator * Takeaway: It’s not a fad! Timestamps * [00:00:27] Introducing Shreya (poorly) * [00:03:38] The 3 V's of ML development * [00:05:45] Bridging Development and Production * [00:08:40] Preventing Data Leakage * [00:10:31] Berkeley's Unique Research Lab Culture * [00:11:53] From Static to Dynamically Updated Data * [00:12:55] Models as views on Data * [00:15:03] Principle: Version everything you do * [00:16:30] Principle: Always validate your data * [00:18:33] Heuristics for Model Architecture Selection * [00:20:36] The LLMOps Stack * [00:22:50] Shadow Models * [00:23:53] Keeping Up With Research * [00:26:10] Grounded Theory Research * [00:27:59] Google Brain vs Academia * [00:31:41] Advice for New Grads * [00:32:59] Helping Minorities in CS * [00:35:06] Lightning Round Transcript [00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio partner and CTM residence at Decibel Partners. I'm joined by my co-host, swyx writer and editor of Latent Space. Yeah, [00:00:21] it's awesome to have another awesome guest Shankar. Welcome . [00:00:25] Thanks for having me. I'm super excited. [00:00:27] Introducing Shreya (poorly) [00:00:27] So I'll intro your formal background and then you can fill in the blanks. [00:00:31] You are a bsms and then PhD at, in, in Computer Science at Stanford. So [00:00:36] I'm, I'm a PhD at Berkeley. Ah, Berkeley. I'm sorry. Oops. . No, it's okay. Everything's the bay shouldn't say that. Everybody, somebody is gonna get mad, but . Lived here for eight years now. So [00:00:50] and then intern at, Google Machine learning learning engineer at Viaduct, an OEM manufacturer, uh, or via OEM analytics platform. [00:00:59] Yes. And now you're an e I R entrepreneur in residence at Amplify. [00:01:02] I think that's on hold a little bit as I'm doing my PhD. It's a very unofficial title, but it sounds fancy on paper when you say [00:01:09] it out loud. Yeah, it is fancy. Well, so that is what people see on your LinkedIn. What's, what should, what should people know about you that's not on your LinkedIn? [00:01:16] Yeah, I don't think I updated my LinkedIn since I started the PhD, so, I'm doing my PhD in databases. It is not AI machine learning, but I work on data management for building AI and ML powered software. I guess like all of my personal interests, I'm super into going for walks, hiking, love, trying coffee in the Bay area. [00:01:42] I recently, I've been getting into cooking a lot. Mm-hmm. , so what kind of cooking? Ooh. I feel like I really like pastas. But that's because I love carbs. So , I don't know if it's the pasta as much as it's the carb. Do you ever cook for [00:01:56] like large [00:01:57] dinners? Large groups? Yeah. We just hosted about like 25 people a couple weeks ago, and I was super ambitious. [00:02:04] I was like, I'm gonna cook for everyone, like a full dinner. But then kids were coming. and I was like, I know they're not gonna eat tofu. The other thing with hosting in the Bay Area is there's gonna be someone vegan. There's gonna be someone gluten-free. Mm-hmm. . There's gonna be someone who's keto. Yeah. [00:02:20] Good luck, . [00:02:21] Oh, you forgot the seeds. That's the sea disrespects. [00:02:25] I know. . So I was like, oh my God, I don't know how I'm gonna do this. Yeah. The dessert too. I was like, I don't know how I'm gonna make everything like a vegan, keto nut free dessert, just water. It was a fun challenge. We ordered pizza for the children and a lot of people ate the pizza. [00:02:43] So I think , that's what happens when you try to cook, cook for everyone. [00:02:48] Yeah. The reason I dug a bit on the cooking is I always find like if you do cook for large groups, it's a little bit like of an ops situation. Yeah. Like a lot of engineering. A lot of like trying to figure out like what you need to deliver and then like what the pipeline [00:02:59] is and Oh, for sure. [00:03:01] You write that Gantt chart like a day in advance. , did you actually have a ga? Oh, I did. My gosh. Of course I had a Gantt chart. I, I dunno how people, did [00:03:08] you orchestrate it with airflow or ? [00:03:12] I orchestrated it myself. . [00:03:15] That's awesome. But yeah, we're so excited to have you, and you've been a pretty prolific writer, researcher, and thank you. [00:03:20] You have a lot of great content out there. I think your website now says, I'm currently learning how to make machine learning work in the real world, which is a challenge that mm-hmm. , everybody is steaming right now from the Microsoft and Googles of the word that have rogue eyes flirting with people, querying them to people, deploy models to production. [00:03:38] The 3 V's of ML development [00:03:38] Maybe let's run through some of the research you've done, especially on lops. Sure. And how to get these things in production. The first thing I really liked from one of your paper was the, the three VS of ML development. Mm-hmm. , which is velocity validation and versioning. And one point that you were making is that the development workflow of software engineering is kind of very different from ML because ML is very experiment driven. [00:04:00] Correct. There's a lot of changes that you need to make, you need to kill things very quickly if they're not working. So maybe run us through why you decided as kind of those three vs. Being some of the, the core things to think about. and some of the other takeaways from their research. Yeah, [00:04:15] so this paper was conducted as a loosely structured interview study. [00:04:18] So the idea is you interview like three or four people and then you go and annotate all the transcripts, tag them, kind of put the word clouds out there, whatever. There's a bunch of like cool software to do this. Then we keep seeing these, themes of velocity wasn't the word, but it was like experiment quickly or high experimentation rate. [00:04:38] Sometimes it was velocity. And we found that that was like the number one thing for people who were talking about their work in this kind of development phase. We also categorized it into phases of the work. So the life cycle like really just fell into place when we annotated the transcripts. And so did the variables. [00:04:55] And after three or four interviews you iterate on them. You kind of iterate on the questions, and you iterate on the codes or the tags that you give to the transcripts and then you do it again. And we repeated this process like three or four times up to that many people, and the story kind of told itself in a way that [00:05:11] makes sense. [00:05:12] I think, like I was trying to figure out why you picked those, but it's interesting to see that everybody kinda has the same challenges. [00:05:18] It fell out. I think a big thing, like even talking to the people who are at the Microsofts and the Googles, they have models in production. They're frequently training these models in production, yet their Devrel work is so experimental. [00:05:31] Mm-hmm. . And we were like, so it doesn't change. Even when you become a mature organization, you still throw 100 darts at the wall for five of them to stick and. That's super interesting and I think that's a little bit unique to data science and machine learning work. [00:05:45] Bridging Development and Production [00:05:45] Yeah. And one one point you had is kind of how do we bridge the gap between the development environments and the production environments? [00:05:51] Obviously you're still doing work in this space. What are some of the top of mind areas of focus for you in [00:05:57] this area? Yeah, I think it. Right now, people separate these environments because the production environment doesn't allow people to move at the rate that they need to for experimentation. A lot of the times as you're doing like deep learning, you wanna have GPUs and you don't wanna be like launching your job on a Kubernetes cluster and waiting for the results to come. [00:06:17] And so that's just the hardware side of things. And then there is the. Execution stack. Um, you wanna be able to query and create features real time as you're kind of training your model. But in production things are different because these features are kind of scheduled, maybe generated every week. [00:06:33] There's a little bit of lag. These assumptions are not accounted for. In development and training time. Mm-hmm. . So of course we're gonna see that gap. And then finally, like the top level, the interface level. People wanna experiment in notebooks, in environments that like allow them to visualize and inspect their state. [00:06:50] But production jobs don't typically run in notebooks. Yeah, yeah, yeah. I mean there, there are tools like paper mill and et cetera. But it's not the same, right? So when you
This blogpost has been updated since original release to add more links and references. The ChatGPT Plugins announcement today could be viewed as the launch of ChatGPT’s “App Store”, a moment as significant as when Apple opened its App Store for the iPhone in 2008 or when Facebook let developers loose on its Open Graph in 2010. With a dozen lines of simple JSON and a mostly-english prompt to help ChatGPT understand what the plugin does, developers will be able to add extensions to ChatGPT to get information and trigger actions in the real world. OpenAI itself launched with some killer first party plugins for: * Browsing the web, * writing AND executing Python code (in an effortlessly multimodal way), * retrieving embedded documents from external datastores, * as well as 11 launch partner plugins from Expedia to Milo to Zapier. My recap thread was well received: But the thing that broke my brain was that ChatGPT’s Python Interpreter plugin can run nontrivial code - users can upload video files and ask ChatGPT to edit it, meaning it now has gone beyond mere chat to offer a substantial compute platform with storage, memory and file upload/download. I immediately started my first AI Twitter Space to process this historical moment with Alessio and friends of the pod live. OpenAI’s Logan (see Episode 1 from *last month*…) suggested that you might be able to link ChatGPT up with Zapier triggers to do arbitrary tasks! and then Flo Crivello, who just launched his AI Assistant startup Lindy, joined us to discuss the builder perspective. Tune in on this EMERGENCY EPISODE of Latent Space to hear developers ask and debate all the issues spilling out from the ChatGPT Plugins launch - and let us know in the comments if you want more/have further questions! SPECIAL NOTE: I was caught up in the hype and was far more negative on Replit than I initially intended as I tried to figure out this new ChatGPT programming paradigm. I regret this. Replit is extremely innovative and well positioned to help you develop and host ChatGPT plugins, and of course Amjad is already on top of it: Mea culpa. Timestamps * [00:00:38] First Reactions to ChatGPT Plugins * [00:07:53] Q&A: Keeping up with AI * [00:10:39] Q&A: ChatGPT Intepreter changes Programming * [00:12:27] Q&A: ChatGPT for Education * [00:15:21] Q&A: GPT4 Sketch to Website Demo * [00:16:32] Q&A: AI Competition and Human Jobs * [00:18:44] ChatGPT Plugins as App Store * [00:34:40] Google vs ChatGPT * [00:36:04] Nader Dabit on Selling His GPT App * [00:43:16] Q&A: ChatGPT Waitlist and Voice * [00:45:26] LangChain with Human in the Loop * [00:46:58] Google vs Microsoft vs Apple * [00:51:43] ChatGPT Plugin Ideas * [00:53:49] Not an app store? * [00:55:24] LangChain and the Future of AI * [01:00:48] Q&A: ChatGPT Bots and Cronjobs * [01:04:43] Logan Joins Us! * [01:07:14] Q&A: Plugins Rollout * [01:08:26] Q&A: Plugins Discovery * [01:10:00] Q&A: OpenAI vs BingChat * [01:11:03] Q&A: App Store Monetization * [01:14:45] Q&A: ChatGPT Plugins API * [01:17:17] Q&A: Python Interpreter * [01:19:58] The History of App Stores and Marketplaces * [01:22:40] LindyAI's Flo Crivello Joins Us * [01:29:42] AI Safety * [01:31:07] Multimodal GPT4 * [01:32:10] Designing AI-safe APIs * [01:34:39] Flo's Closing Comments Transcript [00:00:00] Hello and welcome to the Latent Space Emergency episode. This is our first ever where chatty PT just dropped a plugin ecosystem today, or at least they demoed their plugins. It's still on the wait list, but it is the app store moment for ai. And we did an emergency two hour space with Logan from OpenAI and Flo Coveo from Lin AI and a bunch of our friends. [00:00:28] And if you ever wanted to listen to what it's like to hear developers process in real time when a new launch happens, this is it. Enjoy, [00:00:38] First Reactions to ChatGPT Plugins [00:00:38] I assume everyone has read the blog post. For me the, the big s**t was do you see Greg Brockman's tweet about FFMPEG? I did not. I should check it out. It is amazing. Okay, so. So ChatGPT can generate Python code. We knew this, this is not new, and they can now run the code that it generates. [00:00:58] This is not new. I mean this is like, this is good. It's not like surprising. It's, it's fine. It can run FFMPEG code. You can upload a file, ask it to edit the video file, and it can process the video file and then it can give you the link to download the video file. So it's a general purpose compute platform. [00:01:22] Wow. Did they show how to do this? Agents? I just, I just, I just pinned it. I just, it did I, did I turn into this space? I dunno how to use it. Yeah, it's, it's showing up there. Okay. It can run like is. Is, is, is my And by, by the way hi to people. I, I don't know how to run spaces. I, I not something I normally do. [00:01:42] But You wanna say something? Please request. But yeah, reactions have a look at this video because it run, it generates and runs video editing code. You can upload any arbitrary file. It seems to have good enough compute and memory and file storage. This is not chat anymore, man. I don't know what the hell this is. [00:02:01] What, what is this? [00:02:02] Well, progress has been all faster than I expected. . That's all I can, I, I, I don't know how to respond. . Yeah. It's pretty wild. I wonder, I wonder, I'm wondering how, how this will affect, like opening up the app store different from, let's say Apple App Store when it opened up. Because there are a lot of, of big companies just building stuff already and how like a small developer will be able to, to build something that's not already there. [00:02:31] I dunno. It will be interesting. So one thing that's really nice, have you seen the installation process for the plugins? It's right at the bottom of the blog post and you have to play the video to kind of see it, but literally anybody can write your own plugin. It's a small little json file. It's, it's literally like 10 lines of code. [00:02:49] It's 10 nights of, you described what your plugin does in English, you given an open API spec. That's it. That, that's, that's the plugin. It's amazing. You can distribute your plugin. This is, this is, this is easier than extensions manifest v3, which nobody knows how to use. This is English. [00:03:15] You write English . So, so, yeah. I mean I think, I think I think there'll be a lot of people trying to develop for this if they can get access, which you know, everybody's on a wait list. I, I've, I've signed up to 200 wait lists this week. . I wonder if, if it'll be different if you, if you sign up as a, as a developer or as the chat user. [00:03:35] Hopefully it doesn't matter, right? Use different emails and sign up to both. Let's, let's just see, in fact, use t to generate like, plausible sounding reasons for why you want to build whatever. Cause they don. [00:03:47] But yeah, I mean, how do you compete? I, I don't know, man. You know, it, it's really OpenAI is definitely a partnership strategy to do what they do here which means they're essentially picking favorites. So if you're a competitor of Expedia Kayak Open Table Wolf from Zapier, you're a s**t out of luck, kind of, you know? [00:04:06] Cause these are presumptive winners of their spaces. Right. And it'll happen in too many industries, probably. Right. I was thinking about maybe summarization or, or I don't know, YouTube video summarization, but there seems to be some application of that already on the examples that you shared. Yeah, yeah, yeah. [00:04:26] They have shared that, but I think there's always room to improve the experience. It's just, you know It's interesting which platform, like sort of platform strategy, right? Like if you write an OpenAI chat plugin, you instantly gain access to a hundred million users, right? All of them can instantly use your thing. [00:04:47] Whereas if you are a standalone app or company, good luck trying to able to use OpenAI through you. There's just no point. So you much rather just be on OpenAI platform and promote there. The the fortunate thing is they don't have some kind of like popularity ranking yet. Actually, someone should go open, someone should do register, like OpenAI plugins list.com or something where like everyone can like submit their own opening app plugins and like upload them, review them cuz this like, this is not a complete app store without reviews and a rating system and a reputation system and probably monetization opening app probably doesn't care about that. [00:05:26] But I mean, I can go start that right now. F**k. I can go start it right now. [00:05:34] Yeah, it'll, it'll take a while, right? Like this is the, like the basic version of the, of the app evolving. But this is a pretty basic version. Yeah. The basic version can browse the web, it can write, write an execute code. It can retrieve you know, we can retrieve data from documents, right? So all the documents search just died. [00:06:02] There's like five of these in Y Combinator right now. Oh. [00:06:08] Examples. Pretty crazy how, how they use the FFMPEG library or, I dunno if I'm saying that correctly, but right in there. You don't need to, to write code to, [00:06:27] it's crazy. Dunno. Yeah. Any reactions? Please, please, you know, open space. Anyone can request a speaker. Oh, Ash, come on in. Ash. I have to add you a speaker. Yeah, we're, we're just reacting here. I just, I, I needed a place to talk and I'm in Japan and I don't have anyone else to talk to, so I need, I, I I just want to share this moment. [00:06:46] I think it's a special moment in history. This is the biggest new app source since ever. Yeah. Hey, Shawn. I think plugin is already taken. . Oh man. Someone, someone bought it already. Yep. , of course. Right? Of course. , what are your reactions? What how are you feeling? What's what are you seeing out there? [00:07:07] Just crowdsource all the tweeting. Yeah, man, it's, it's been wild. I mean, I get ou
If Text is the Universal Interface, then Text to SQL is perhaps the killer B2B business usecase for Generative AI. You may have seen incredible demos from Perplexity AI, OSS Insights, and CensusGPT where the barrier of learning SQL and schemas goes away and you can intuitively converse with your data in natural language. But in the multi-billion dollar data engineering industry, Seek.ai has emerged as the forerunner in building a conversational engine and knowledge base that truly democratizes data insights. We’re proud to present our first remote interview with Sarah Nagy to learn how AI can help you “seek what matters”! Timestamps * 00:00: Intro to Sarah * 03:40: Seek.ai origin * 05:45: Data driven vs Data backfit * 09:15: How Enterprises adopt AI * 12:55: Patents and IP Law * 14:05: The Semantic Layer * 16:35: Interfaces - Dashboards vs Chat? * 21:05: LLM performance and selection * 26:05: LLMOps and LangChain * 30:55: Lightning round Show notes * Sarah Nagy Linkedin * Seek.ai * Sarah on the dbt podcast Lightning Rounds * Favorite AI Product: Stable Diffusion * Favorite AI Community: Eleuther * One year prediction: Things will move fast! * Request for Startup: Scheduling/Emails (shoutout Ipso.ai from our hackathon!) * Takeaway: Automate everything! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
OpenAI just rollicked the AI world yet again yesterday — while releasing the long awaited ChatGPT API, they also priced it at $2 per million tokens generated, which is 90% cheaper than the text-davinci-003 pricing of the “GPT3.5” family. Their blogpost on how they did it is vague: Through a series of system-wide optimizations, we’ve achieved 90% cost reduction for ChatGPT since December; we’re now passing through those savings to API users. We were fortunate enough to record Episode 2 of our podcast with someone who routinely creates 90%+ improvements for their customers, and in fact have started productizing their own infra skills with Codeium, the rapidly growing free-forever Copilot alternative (see What Building “Copilot for X” Really Takes). Varun Mohan is CEO of Exafunction/Codeium, and he indulged us in diving deep into AI infrastructure, compute-optimal training vs inference tradeoffs, and why he loves suffering. Recorded in-person at the beautiful StudioPod studios in San Francisco. Full transcript is below the fold. Timestamps * 00:00: Intro to Varun and Exafunction * 03:06: GPU Efficiency, Model Flop Utilization, Dynamic Multiplexing * 05:30: Should companies own their ML infrastructure? * 07:00: The two kinds of LLM Applications * 08:30: Codeium * 14:50: “Our growth is 4-5% day over day” * 16:30: Latency, Quality, and Correctability * 20:30: Acceleration mode vs Exploration mode * 22:00: Copilot for X - Harvey AI’s deal with Allen & Overy * 25:00: Scaling Laws (Chinchilla) * 28:45: “The compute-optimal model might not be easy to serve” * 30:00: Smaller models * 32:30: Deepmind Retro can retrieve external infromation * 34:30: Implications for embedding databases * 37:10: LLMOps - Eval, Data Cleaning * 39:45: Testing/User feedback * 41:00: “Users Is All You Need” * 42:45: General Intelligence + Domain Specific Dataset * 43:15: The God Nvidia computer * 46:00: Lightning round Show notes * Varun Mohan Linkedin * Exafunction * Blogpost: Are GPUs Worth it for ML * Codeium * Copilot statistics * Eleuther’s The Pile and The Stack * What Building “Copilot for X” Really Takes * Copilot for X * Harvey, Copilot for Law - deal with Allen & Overy * Scaling Laws * Training Compute-Optimal Large Language Models - arXiv (Chinchilla paper) * chinchilla's wild implications (LessWrong) * UL2 20B: An Open Source Unified Language Learner (20B) * Paper - Deepmind Retro * “Does it make your beer taste better” * HumanEval benchmark/dataset * Reverse Engineering Copilot internals * Quora Poe * Prasanna Sankar notes on FLOPs and Bandwidth * NVIDIA H100 specs - 3TB/s GPU memory, 900GB/s NVLink Interconnect * Optimizer state is 14x size of model - 175B params => 2.5TB to store state → needs at least 30 H100 machines with 80GB each * Connor Leahy on The Gradient Podcast Lightning Rounds * Favorite AI Product: Midjourney * Favorite AI Community: Eleuther and GPT-J * One year prediction: Better models, more creative usecases * Request for Startup: Superathlete Fitness Assistant * Takeaway: Continue to tinker! Transcript [00:00:00] Alessio Fanelli: Hey everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my cohost, swyx, writer, editor of L Space Diaries. [00:00:20] swyx: Hey, and today we have Varun Mohan from Codeium / Exafunction on. I should introduce you a little bit because I like to get the LinkedIn background out of the way. [00:00:30] So you did CS at MIT and then you spent a few years at Nuro where you were ultimately tech lead manager for autonomy. And that's an interesting dive. Self-driving cars in AI and then you went straight into Exafunction with a few of your coworkers and that's where I met some of them and started knowing about Exafunction. [00:00:51] And then from out of nowhere you cloned GitHub Copilot. That's a lot of progress in a very short amount of time. So anyway, welcome . [00:00:59] Varun Mohan: That's high praise. [00:01:00] swyx: What's one thing about you that doesn't appear on LinkedIn that is a big part of what people should know? [00:01:05] Varun Mohan: I actually really like endurance sports actually. [00:01:09] Like I, I've done multiple triathlons. I've actually biked from San Francisco to LA. I like things that are like suffering. I like to suffer while I, while I do sports. Yeah. [00:01:19] swyx: Do you think a lot about like code and tech while you're doing those endurance sports or are you just, [00:01:24] Varun Mohan: your mind is just focused? [00:01:26] I think it's maybe a little bit of both. One of the nice things about, I guess, endurance athletics, It's one of the few things you can do where you're not thinking about, you can't really think about much beyond suffering. Like you're climbing up a hill on a bike and you see like, uh, you see how many more feet you need to climb, and at that point you're just struggling. [00:01:45] That's your only job. Mm-hmm. . Yeah. The only thing you can think of is, uh, pedaling one more pedal. So it's actually like a nice, a nice way to not think about work. Yeah, [00:01:53] Alessio Fanelli: yeah, yeah. Maybe for the audience, you wanna tell a bit about exa function, how that came to be and how coding came out [00:01:59] Varun Mohan: of that. So a little bit about exo function. [00:02:02] Before working at exa function, I worked at Neuro as Sean was just saying, and at neuro, I sort of managed large scale offline deep learning infrastructure. Realized that deep learning infrastructure is really hard to build and really hard to maintain for even the most sophisticated companies, and started exa function to basically solve that gap, to make it so that it was much easier for companies. [00:02:24] To serve deep learning workloads at scale. One of the key issues that we noticed is GPUs are extremely hard to manage fundamentally because they work differently than CPUs. And once a company has heterogeneous hardware requirements, it's hard to make sure that you get the most outta the hardware. It's hard to make sure you can get, get great GPU utilization and exa function was specifically built to make it so that you could get the most outta the hardware. [00:02:50] Make sure. Your GP was effectively virtualized and decoupled from your workload to make it so that you could be confident that you were running at whatever scale you wanted without burning the bank. [00:03:00] swyx: Yeah. You gave me this metric about inefficiency, [00:03:03] Varun Mohan: right? Oh, okay. Like flop efficiency. Yeah. Yeah. So basically, I think it comes down to, for most people, one of the things about CPUs that's really nice is with containers, right? [00:03:13] You can end up having a single. You can place many containers on them and all the containers will slowly start eating the compute. It's not really the same with GPUs. Like let's say you have a single. For the most part, only have one container using that gpu. And because of that, people heavily underestimate what a single container can sort of do. [00:03:33] And the GPU is left like heavily idle. And I guess the common term now with a lot of LM workloads is like the flop efficiency of these workloads. M F U, yeah. Yeah. Model flop utilization. The model flop utilization, which is basically like what fraction of the flops or compute on the hardware is actually getting used. [00:03:49] And sort of what we did at exa function. Not only make it so that the model was always running, we also built compiler technology to make it so that the model was also running more efficiently. And some of these things are with tricks like operator fusion, like basically you could imagine fusing two operations together such that the time it takes to compute. [00:04:07] the fused operation is lower than the time it takes for each individual operation. Oh my God. Yeah. . [00:04:13] Alessio Fanelli: Yeah. And you have this technique called dynamic multiplexing, which is basically, instead of having a one-to-one relationship, you have one GP for multiple clients. And I saw one of your customers, they went from three clients to just one single GPU and the cost by 97%. [00:04:29] What were some of those learning, seeing hardware usage and efficiencies and how that then played into what, what [00:04:34] Varun Mohan: you're building? Yeah, I think it basically showed that there was probably a gap with even very sophisticated teams. Making good use of the hardware is just not an easy problem. I think that was the main I, it's not that these teams were like not good at what they were doing, it's just that they were trying to solve a completely separate problem. [00:04:50] They had a model that was trained in-house and their goal was to just run it and it, that should be an easy. Easy thing to do, but surprisingly still, it's not that easy. And that problem compounds in complexity with the fact that there are more accelerators now in the cloud. There's like TPUs, inferential and there's a lot of decisions, uh, that users need to make even in terms of GPU types. [00:05:10] And I guess sort of what we had was we had internal expertise on what the right way to run the workload was, and we were basically able to build infrastructure and make it so that companies could do that without thinking. So most [00:05:21] Alessio Fanelli: teams. Under utilizing their hardware, how should they think about what to own? [00:05:26] You know, like should they own the appearance architecture? Like should they use Xlo to get it to production? How do you think [00:05:32] Varun Mohan: about it? So I think one thing that has proven to be true over the last year and a half is companies, for the most part, should not be trying to figure out what the optimal ML architecture is or training architecture is. [00:05:45] Especially with a lot of these large language models. We have generic models and transformer architecture that are solving a lot of distinct pro
We’re so glad to launch our first podcast episode with Logan Kilpatrick! This also happens to be his first public interview since joining OpenAI as their first Developer Advocate. Thanks Logan! Recorded in-person at the beautiful StudioPod studios in San Francisco. Full transcript is below the fold. Timestamps * 00:29: Logan’s path to OpenAI * 07:06: On ChatGPT and GPT3 API * 16:16: On Prompt Engineering * 20:30: Usecases and LLM-Native Products * 25:38: Risks and benefits of building on OpenAI * 35:22: OpenAI Codex * 42:40: Apple's Neural Engine * 44:21: Lightning Round Show notes * Sam Altman’s interview with Connie Loizos * OpenAI Cookbook * OpenAI’s new Embedding Model * Cohere on Word and Sentence Embeddings * (referenced) What is AGI-hard? Lightning Rounds * Favorite AI Product: https://www.synthesia.io/ * Favorite AI Community: MLOps * One year prediction: Personalized AI, https://civitai.com/ * Takeaway: AI Revolution is here! Transcript [00:00:00] Alessio Fanelli: Hey everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my cohost, swyx writer editor of L Space Diaries. Hey. [00:00:20] swyx: Hey . Our guest today is Logan Kilpatrick. What I'm gonna try to do is I'm gonna try to introduce you based on what people know about you, and then you can fill in the blanks. [00:00:28] Introducing Logan [00:00:28] swyx: So you are the first. Developer advocate at OpenAI, which is a humongous achievement. Congrats. You're also the lead developer community advocate of the Julia language. I'm interested in a little bit of that and apparently as I've did a bit of research on you, you got into Julia through NASA where you interned and worked on stuff that's gonna land on the moon apparently. [00:00:50] And you are also working on computer vision at Apple. And had to sit at path, the eye as you fell down the machine learning rabbit hole. What should people know about you that's kind of not on your LinkedIn that like sort of ties together your interest [00:01:02] Logan Kilpatrick: in story? It's a good question. I think so one of the things that is on my LinkedIn that wasn't mentioned that's super near and dear to my heart and what I spend a lot of time in sort of wraps a lot of my open source machine learning developer advocacy experience together is supporting NumFOCUS. [00:01:17] And NumFOCUS is the nonprofit that helps enable a bunch of the open source scientific projects like Julia, Jupyter, Pandas, NumPy, all of those open source projects are. Facilitated legal and fiscally through NumFOCUS. So it's a very critical, important part of the ecosystem and something that I, I spend a bunch of my now more limited free time helping support. [00:01:37] So yeah, something that's, It's on my LinkedIn, but it's, it's something that's important to me. Well, [00:01:42] swyx: it's not as well known of a name, so maybe people kind of skip over it cuz they were like, I don't know what [00:01:45] Logan Kilpatrick: to do with this. Yeah. It's super interesting to see that too. Just one point of context for that is we tried at one point to get a Wikipedia page for non focus and it's, it's providing, again, the infrastructure for, it's like a hundred plus open source scientific projects and they're like, it's not notable enough. [00:01:59] I'm like, well, you know, there's something like 30 plus million developers around the world who use all these open source tools. It's like the foundation. All open source like science that happens. Every breakthrough in science is they discovered the black hole, the first picture of the black hole, all that stuff using numb focus tools, the Mars Rovers, NumFOCUS tools, and it's interesting to see like the disconnect between the nonprofit that supports those projects and the actual success of the projects themselves. [00:02:26] swyx: Well, we'll, we'll get a bunch of people focused on NumFOCUS and we'll get it on Wikipedia. That that is our goal. . That is the goal. , that is our shot. Is this something that you do often, which is you? You seem to always do a lot of community stuff. When you get into something, you're also, I don't know where this, where you find time for this. [00:02:42] You're also a conference chair for DjangoCon, which was last year as well. Do you fall down the rabbit hole of a language and then you look for community opportunities? Is that how you get into. [00:02:51] Logan Kilpatrick: Yeah, so the context for Django stuff was I'd actually been teaching and still am through Harvard's division of continuing education as a teaching fellow for a Django class, and had spent like two and a half years actually teaching students every semester, had a program in Django and realized that like it was kind of the one ecosystem or technical tool that I was using regularly that I wasn't actually contributing to that community. [00:03:13] So, I think sometime in 2021 like applied to be on the board of directors of the Django Events Foundation, north America, who helps run DjangoCon and was fortunate enough to join a support to be the chair of DjangoCon us and then just actually rolled off the board because of all the, all the craziness and have a lot less free time now. [00:03:32] And actually at PATH ai. Sort of core product was also using, was using Django, so it also had a lot of connections to work, so it was a little bit easier to justify that time versus now open ai. We're not doing any Django stuff unfortunately, so, or [00:03:44] swyx: Julia, I mean, should we talk about this? Like, are you defecting from Julia? [00:03:48] What's going on? , [00:03:50] Logan Kilpatrick: it's actually felt a little bit strange recently because I, for the longest time, and, and happy to talk about this in the context of Apple as well, the Julie ecosystem was my outlet to do a lot of the developer advocacy, developer relations community work that I wanted to do. because again, at Apple I was just like training machine learning models. [00:04:07] Before that, doing software engineering at Apple, and even at Path ai, we didn't really have a developer product, so it wasn't, I was doing like advocacy work, but it wasn't like developer relations in the traditional sense. So now that I'm so deeply doing developer relations work at Open OpenAI, it's really difficult to. [00:04:26] Continue to have the energy after I just spent nine hours doing developer relations stuff to like go and after work do a bunch more developer relations stuff. So I'll be interested to see for myself like how I'm able to continue to do that work and I. The challenge is that it's, it's such critical, important work to happen. [00:04:43] Like I think the Julie ecosystem is so important. I think the language is super important. It's gonna continue to grow in, in popularity, and it's helping scientists and engineers solve problems they wouldn't otherwise be able to. So it's, yeah, the burden is on me to continue to do that work, even though I don't have a lot of time now. [00:04:58] And I [00:04:58] Alessio Fanelli: think when it comes to communities, the machine learning technical community, I think in the last six to nine months has exploded. You know, you're the first developer advocate at open ai, so I don't think anybody has a frame of reference on what that means. What is that? ? So , what do you, how did, how the [00:05:13] swyx: job, yeah. [00:05:13] How do you define the job? Yeah, let's talk about that. Your role. [00:05:16] Logan Kilpatrick: Yeah, it's a good question and I think there's a lot of those questions that actually still exist at OpenAI today. Like I think a lot of traditional developed by advocacy, at least like what you see on Twitter, which I think is what a lot of people's perception of developer advocacy and developer relations is, is like, Just putting out external content, going to events, speaking at conferences. [00:05:35] And I think OpenAI is very unique in the sense that, at least at the present moment, we have so much inbound interest that there's, there is no desire for us to like do that type of developer advocacy work. So it's like more from a developer experience point of view actually. Like how can we enable developers to be successful? [00:05:53] And that at the present moment is like building a strong foundation of documentation and things like that. And we had a bunch of amazing folks internally who were. Who were doing some of this work, but it really wasn't their full-time job. Like they were focused on other things and just helping out here and there. [00:06:05] And for me, my full-time job right now is how can we improve the documentation so that people can build the next generation of, of products and services on top of our api. And it's. Yeah. There's so much work that has to happen, but it's, it's, it's been a ton of fun so far. I find [00:06:20] swyx: being in developer relations myself, like, it's kind of like a fill in the blanks type of thing. [00:06:24] Like you go to where you, you're needed the most open. AI has no problem getting attention. It is more that people are not familiar with the APIs and, and the best practices around programming for large language models, which is a thing that did not exist three years ago, two years ago, maybe one year ago. [00:06:40] I don't know. When she launched your api, I think you launched Dall-E. As an API or I, I don't [00:06:45] Logan Kilpatrick: know. I dunno. The history, I think Dall-E was, was second. I think it was some of the, like GPT3 launched and then GPT3 launched and the API I think like two years ago or something like that. And then Dali was, I think a little more than a year ago. [00:06:58] And then now all the, the Chachi Beast ChatGPT stuff has, has blown it all outta the water. Which you have [00:07:04] swyx: a a wait list for. Should we get into that? [00:07:06] Logan Kilpatrick: Yeah. . [00:07:07] ChatGPT [00:07:07] Alessio Fanelli: