



New Jersey fined Apple over repeat retail pricing violations
























Alfred Wahlforss was running out of options. His startup, Listen Labs, needed to hire over 100 engineers, but competing against Mark Zuckerberg's $100 million offers seemed impossible. So he spent $5,000 — a fifth of his marketing budget — on a billboard in San Francisco displaying what looked like gibberish: five strings of random numbers.
The numbers were actually AI tokens. Decoded, they led to a coding challenge: build an algorithm to act as a digital bouncer at Berghain, the Berlin nightclub famous for rejecting nearly everyone at the door. Within days, thousands attempted the puzzle. 430 cracked it. Some got hired. The winner flew to Berlin, all expenses paid.
That unconventional approach has now attracted $69 million in Series B funding, led by Ribbit Capital with participation from Evantic and existing investors Sequoia Capital, Conviction, and Pear VC. The round values Listen Labs at $500 million and brings its total capital to $100 million. In nine months since launch, the company has grown annualized revenue by 15x to eight figures and conducted over one million AI-powered interviews.
"When you obsess over customers, everything else follows," Wahlforss said in an interview with VentureBeat. "Teams that use Listen bring the customer into every decision, from marketing to product, and when the customer is delighted, everyone is."
Listen's AI researcher finds participants, conducts in-depth interviews, and delivers actionable insights in hours, not weeks. The platform replaces the traditional choice between quantitative surveys — which provide statistical precision but miss nuance—and qualitative interviews, which deliver depth but cannot scale.
Wahlforss explained the limitation of existing approaches: "Essentially surveys give you false precision because people end up answering the same question... You can't get the outliers. People are actually not honest on surveys." The alternative, one-on-one human interviews, "gives you a lot of depth. You can ask follow up questions. You can kind of double check if they actually know what they're talking about. And the problem is you can't scale that."
The platform works in four steps: users create a study with AI assistance, Listen recruits participants from its global network of 30 million people, an AI moderator conducts in-depth interviews with follow-up questions, and results are packaged into executive-ready reports including key themes, highlight reels, and slide decks.
What distinguishes Listen's approach is its use of open-ended video conversations rather than multiple-choice forms. "In a survey, you can kind of guess what you should answer, and you have four options," Wahlforss said. "Oh, they probably want me to buy high income. Let me click on that button versus an open ended response. It just generates much more honesty."
Listen finds and qualifies the right participants in its global network of 30 million people. But building that panel required confronting what Wahlforss called "one of the most shocking things that we've learned when we entered this industry"—rampant fraud.
"Essentially, there's a financial transaction involved, which means there will be bad players," he explained. "We actually had some of the largest companies, some of them have billions in revenue, send us people who claim to be kind of enterprise buyers to our platform and our system immediately detected, like, fraud, fraud, fraud, fraud, fraud."
The company built what it calls a "quality guard" that cross-references LinkedIn profiles with video responses to verify identity, checks consistency across how participants answer questions, and flags suspicious patterns. The result, according to Wahlforss: "People talk three times more. They're much more honest when they talk about sensitive topics like politics and mental health."
Emeritus, an online education company that uses Listen, reported that approximately 20% of survey responses previously fell into the fraudulent or low-quality category. With Listen, they reduced this to almost zero. "We did not have to replace any responses because of fraud or gibberish information," said Gabrielli Tiburi, Assistant Manager of Customer Insights at Emeritus.
The speed advantage has proven central to Listen's pitch. Traditional customer research at Microsoft could take four to six weeks to generate insights. "By the time we get to them, either the decision has been made or we lose out on the opportunity to actually influence it," said Romani Patel, Senior Research Manager at Microsoft.
With Listen, Microsoft can now get insights in days, and in many cases, within hours.
The platform has already powered several high-profile initiatives. Microsoft used Listen Labs to collect global customer stories for its 50th anniversary celebration. "We wanted users to share how Copilot is empowering them to bring their best self forward," Patel said, "and we were able to collect those user video stories within a day." Traditionally, that kind of work would have taken six to eight weeks.
Simple Modern, an Oklahoma-based drinkware company, used Listen to test a new product concept. The process took about an hour to write questions, an hour to launch the study, and 2.5 hours to receive feedback from 120 people across the country. "We went from 'Should we even have this product?' to 'How should we launch it?'" said Chris Hoyle, the company's Chief Marketing Officer.
Chubbies, the shorts brand, achieved a 24x increase in youth research participation—growing from 5 to 120 participants — by using Listen to overcome the scheduling challenges of traditional focus groups with children. "There's school, sports, dinner, and homework," explained Lauren Neville, Director of Insights and Innovation. "I had to find a way to hear from them that fit into their schedules."
The company also discovered product issues through AI interviews that might have gone undetected otherwise. Wahlforss described how the AI "through conversations, realized there were like issues with the the kids short line, and decided to, like, interview hundreds of kids. And I understand that there were issues in the liner of the shorts and that they were, like, scratchy, quote, unquote, according to the people interviewed." The redesigned product became "a blockbuster hit."
Listen Labs is entering a massive but fragmented market. Wahlforss cited research from Andreessen Horowitz estimating the market research industry at roughly $140 billion annually, populated by legacy players — some with more than a billion dollars in revenue — that he believes are vulnerable to disruption.
"There are very much existing budget lines that we are replacing," Wahlforss said. "Why we're replacing them is that one, they're super costly. Two, they're kind of stuck in this old paradigm of choosing between a survey or interview, and they also take months to work with."
But the more intriguing dynamic may be that AI-powered research doesn't just replace existing spending — it creates new demand. Wahlforss invoked the Jevons paradox, an economic principle that occurs when technological advancements make a resource more efficient to use, but increased efficiency leads to increased overall consumption rather than decreased consumption.
"What I've noticed is that as something gets cheaper, you don't need less of it. You want more of it," Wahlforss explained. "There's infinite demand for customer understanding. So the researchers on the team can do an order of magnitude more research, and also other people who weren't researchers before can now do that as part of their job."
Listen Labs traces its origins to a consumer app that Wahlforss and his co-founder built after meeting at Harvard. "We built this consumer app that got 20,000 downloads in one day," Wahlforss recalled. "We had all these users, and we were thinking like, okay, what can we do to get to know them better? And we built this prototype of what Listen is today."
The founding team brings an unusual pedigree. Wahlforss's co-founder "was the national champion in competitive programming in Germany, and he worked at Tesla Autopilot." The company claims that 30% of its engineering team are medalists from the International Olympiad in Informatics — the same competition that produced the founders of Cognition, the AI coding startup.
The Berghain billboard stunt generated approximately 5 million views across social media, according to Wahlforss. It reflected the intensity of the talent war in the Bay Area.
"We had to do these things because some of our, like early employees, joined the company before we had a working toilet," he said. "But now we fixed that situation."
The company grew from 5 to 40 employees in 2024 and plans to reach 150 this year. It hires engineers for non-engineering roles across marketing, growth, and operations — a bet that in the AI era, technical fluency matters everywhere.
Wahlforss outlined an ambitious product roadmap that pushes into more speculative territory. The company is building "the ability to simulate your customers, so you can take all of those interviews we've done, and then extrapolate based on that and create synthetic users or simulated user voices."
Beyond simulation, Listen aims to enable automated action based on research findings. "Can you not just make recommendations, but also create spawn agents to either change things in code or some customer churns? Can you give them a discount and try to bring them back?"
Wahlforss acknowledged the ethical implications. "Obviously, as you said, there's kind of ethical concerns there. Of like, automated decision making overall can be bad, but we will have considerable guardrails to make sure that the companies are always in the loop."
The company already handles sensitive data with care. "We don't train on any of the data," Wahlforss said. "We will also scrub any sensitive PII automatically so the model can detect that. And there are times when, for example, you work with investors, where if you accidentally mention something that could be material, non public information, the AI can actually detect that and remove any information like that."
Perhaps the most provocative implication of Listen's model is how it could reshape product development itself. Wahlforss described a customer — an Australian startup — that has adopted what amounts to a continuous feedback loop.
"They're based in Australia, so they're coding during the day, and then in their night, they're releasing a Listen study with an American audience. Listen validates whatever they built during the day, and they get feedback on that. They can then plug that feedback directly into coding tools like Claude Code and iterate."
The vision extends Y Combinator's famous dictum — "write code, talk to users" — into an automated cycle. "Write code is now getting automated. And I think like talk to users will be as well, and you'll have this kind of infinite loop where you can start to ship this truly amazing product, almost kind of autonomously."
Whether that vision materializes depends on factors beyond Listen's control — the continued improvement of AI models, enterprise willingness to trust automated research, and whether speed truly correlates with better products. A 2024 MIT study found that 95% of AI pilots fail to move into production, a statistic Wahlforss cited as the reason he emphasizes quality over demos.
"I'm constantly have to emphasize like, let's make sure the quality is there and the details are right," he said.
But the company's growth suggests appetite for the experiment. Microsoft's Patel said Listen has "removed the drudgery of research and brought the fun and joy back into my work." Chubbies is now pushing its founder to give everyone in the company a login. Sling Money, a stablecoin payments startup, can create a survey in ten minutes and receive results the same day.
"It's a total game changer," said Ali Romero, Sling Money's marketing manager.
Wahlforss has a different phrase for what he's building. When asked about the tension between speed and rigor — the long-held belief that moving fast means cutting corners — he cited Nat Friedman, the former GitHub CEO and Listen investor, who keeps a list of one-liners on his website.
One of them: "Slow is fake."
It's an aggressive claim for an industry built on methodological caution. But Listen Labs is betting that in the AI era, the companies that listen fastest will be the ones that win. The only question is whether customers will talk back.
Kilo Code, the open-source AI coding startup backed by GitLab cofounder Sid Sijbrandij, is launching a Slack integration that allows software engineering teams to execute code changes, debug issues, and push pull requests directly from their team chat — without opening an IDE or switching applications.
The product, called Kilo for Slack, arrives as the AI-assisted coding market heats up with multibillion-dollar acquisitions and funding rounds. But rather than building another siloed coding assistant, Kilo is making a calculated bet: that the future of AI development tools lies not in locking engineers into a single interface, but in embedding AI capabilities into the fragmented workflows where decisions actually happen.
"Engineering teams don't make decisions in IDE sidebars. They make them in Slack," Scott Breitenother, Kilo Code's co-founder and CEO, said in an interview with VentureBeat. "The Slackbot allows you to do all this — and more — without leaving Slack."
The launch also marks a partnership with MiniMax, the Hong Kong-based AI company that recently completed a successful initial public offering. MiniMax's M2.1 model will serve as the default model powering Kilo for Slack — a decision the company frames as a statement about the closing gap between open-weight and proprietary frontier models.
The integration operates on a simple premise: Slack threads often contain the context needed to fix a bug or implement a feature, but that context gets lost the moment a developer switches to their code editor.
With Kilo for Slack, users mention @Kilo in a Slack thread, and the bot reads the entire conversation, accesses connected GitHub repositories, and either answers questions about the codebase or creates a branch and submits a pull request.
A typical interaction might look like this: A product manager reports a bug in a Slack channel. Engineers discuss potential causes. Instead of someone copying the conversation into their IDE and re-explaining the problem to an AI assistant, a developer simply types: "@Kilo based on this thread, can you implement the fix for the null pointer exception in the Authentication service?"
The bot then spins up a cloud agent, reads the thread context, implements the fix, and pushes a pull request — all visible in Slack.
The company says the entire process eliminates the need to copy information between apps or jump between windows — developers can trigger complex code changes with nothing more than a single message in Slack.
Kilo's launch explicitly positions the product against two leading AI coding tools: Cursor, which raised $2.3 billion at a $29.3 billion valuation in November, and Claude Code, Anthropic's agentic coding tool.
Breitenother outlined specific limitations he sees in both products' Slack capabilities.
"The Cursor Slack integration is configured on a single-repository basis per workspace or channel," he said. "As a result, if a Slack thread references multiple repositories, users need to manually switch or reconfigure the integration to pull in that additional context."
On Anthropic's offering, he added: "Claude Code documentation for Slack shows how Claude can be added to a workspace and respond to mentions using the surrounding conversation context. However, it does not describe persistent, multi-turn thread state or task-level continuity across longer workflows. Each interaction is handled based on the context included at the time of the prompt, rather than maintaining an evolving execution state over time."
Kilo claims its integration works across multiple repositories simultaneously, maintains conversational context across extended Slack threads, and enables handoffs between Slack, IDEs, cloud agents, and the command-line interface.
Perhaps the most provocative element of the announcement is Kilo's choice of default model. MiniMax is headquartered in Shanghai and recently went public in Hong Kong — a lineage that may raise eyebrows among enterprise customers wary of sending proprietary code through Chinese infrastructure.
Breitenother addressed the concern directly: "MiniMax's recent Hong Kong IPO drew backing from major global institutional investors, including Baillie Gifford, ADIA, GIC, Mirae Asset, Aspex, and EastSpring. This speaks to strong global confidence in models built for global users."
He emphasized that MiniMax models are hosted by major U.S.-compliant cloud providers. "MiniMax M2-series are global leading open-source models, and are hosted by many U.S. compliant cloud providers such as AWS Bedrock, Google Vertex and Microsoft AI Foundry," he said. "In fact, MiniMax models were featured by Matt Garman, the AWS CEO, during this year's re:Invent keynote, showing they're ready for enterprise use at scale."
The company stresses that Kilo for Slack is fundamentally model-agnostic. "Kilo doesn't force customers into any single model," Breitenother said. "Enterprise customers choose which models they use, where they're hosted, and what fits their security, compliance, and risk requirements. Kilo offers access to more than 500 models, so teams can always choose the right model for the job."
The decision to default to M2.1 reflects Kilo's broader thesis about the AI market. According to the company, the performance gap between open-weight and proprietary models has narrowed from 8 percent to 1.7 percent on several key benchmarks. Breitenother clarified that this figure "refers to convergence between open and closed models as measured by the Stanford AI Index using major general benchmarks like HumanEval, MATH, and MMLU, not to any specific agentic coding evaluation."
In third-party evaluations, M2.1 has performed competitively. "In LMArena, an open platform for community-driven AI benchmarking, M2.1 achieved a number-four ranking, right after OpenAI, Anthropic, and Google," Breitenother noted. "What this shows is that M2.1 competes with frontier models in real-world coding workflows, as judged directly by developers."
For engineering teams evaluating the tool, a critical question is what happens to sensitive code and conversations when routed through the integration.
Breitenother walked through the data flow: "When someone mentions @Kilo in Slack, Kilo reads only the content of the Slack thread where it's mentioned, along with basic metadata needed to understand context. It does not have blanket access to a workspace. Access is governed by Slack's standard permission model and the scopes the customer approves during installation."
For repository access, he added: "If the request requires code context, Kilo accesses only the GitHub repositories the customer has explicitly connected. It does not index unrelated repos. Permissions mirror the access level granted through GitHub, and Kilo can't see anything the user or workspace hasn't authorized."
The company states that data is not used to train models and that output visibility follows existing Slack and GitHub permissions.
A particularly thorny question for any AI system that can push code directly to repositories is security. What prevents an AI-generated vulnerability from being merged into production?
"Nothing gets merged automatically," Breitenother said. "When the Kilo Slackbot opens a pull request from a Slack thread, it follows the same guardrails teams already rely on today. The PR goes through existing review workflows and approval processes before anything reaches production."
He added that Kilo can automatically run its built-in code review feature on AI-generated pull requests, "flagging potential issues or security concerns before it ever reaches a developer for review."
Kilo Code sits in an increasingly common but still tricky position: the open-source company charging for hosted services. The complete IDE extension is open-source under an Apache 2.0 license, but Kilo for Slack is a paid, hosted product.
The obvious question: What stops a well-funded competitor — or even a customer — from forking the code and building their own version?
"Forking the code isn't what worries us, because the code itself isn't the hardest part," Breitenother said. "A competitor could fork the repository tomorrow. What they wouldn't get is the infrastructure that safely executes agentic workflows across Slack, GitHub, IDEs, and cloud agents. The experience we've built operating this at scale across many teams and repositories. The trust, integrations, and enterprise-ready controls customers expect out of the box."
He drew parallels to other successful open-source companies: "Open core drives adoption and trust, while the hosted product delivers convenience, reliability, and ongoing innovation. Customers aren't paying for access to code. They're paying for a system that works every day, securely, at scale."
Kilo enters a market that has attracted extraordinary attention and capital over the past year. The practice of using large language models to write and modify code — popularly known as "vibe coding," a term coined by OpenAI co-founder Andrej Karpathy in February 2025 — has become a central focus of enterprise AI investment.
Microsoft CEO Satya Nadella disclosed in April that AI-generated code now accounts for 30 percent of Microsoft's codebase. Google acquired senior employees from AI coding startup Windsurf in a $2.4 billion transaction in July. Cursor's November funding round valued the company at $29.3 billion.
Kilo raised $8 million in seed funding in December 2025 from Breakers, Cota Capital, General Catalyst, Quiet Capital, and Tokyo Black. Sijbrandij, who stepped down as GitLab CEO in 2024 to focus on cancer treatment but remains board chair, contributed early capital and remains involved in day-to-day strategy.
Asked about non-compete considerations given GitLab's own AI investments, Breitenother was brief: "There are no non-compete issues. Kilo is building a fundamentally different approach to AI coding."
Notably, GitLab disclosed in a recent SEC filing that it paid Kilo $1,000 in exchange for a right of first refusal for 10 business days should the startup receive an acquisition proposal before August 2026.
When asked to name an enterprise customer using the Slack integration in production, Breitenother declined: "That's not something we can disclose."
The most significant threat to Kilo's position may come not from other startups but from the frontier AI labs themselves. OpenAI and Anthropic are both building deeper integrations for coding workflows, and both have vastly greater resources.
Breitenother argued that Kilo's advantage lies in its architecture, not its model performance.
"We don't think the long-term moat in AI coding is raw compute or who ships a Slack agent first," he said. "OpenAI and Anthropic are world-class model companies, and they'll continue to build impressive capabilities. But Kilo is built around a different thesis: the hard problem isn't generating code, it's integrating AI into real engineering workflows across tools, repos, and environments."
He outlined three areas where he believes Kilo can differentiate:
"Workflow depth: Kilo is designed to operate across Slack, IDEs, cloud agents, GitHub, and the CLI, with persistent context and execution. Even with OpenAI or Anthropic Slack-native agents, those agents are still fundamentally model-centric. Kilo is workflow-centric."
"Model flexibility: We're model-agnostic by design. Teams don't have to bet on one frontier model or vendor roadmap. That's difficult for companies like OpenAI or Anthropic, whose incentives are naturally aligned with driving usage toward their own models first."
"Platform neutrality: Kilo isn't trying to pull developers into a closed ecosystem. It fits into the tools teams already use."
Kilo's launch reflects a maturing phase in the AI coding market. The initial wave of tools focused on proving that large language models could generate useful code. The current wave is about integration — fitting AI capabilities into the messy reality of how software actually gets built.
That reality involves context fragmented across Slack threads, GitHub issues, IDE windows, and command-line sessions. It involves teams that use different models for different tasks and organizations with complex compliance requirements around data residency and model providers.
Kilo is betting that the winners in this market will not be the companies with the best models, but those that best solve the integration problem — meeting developers in the tools they already use rather than forcing them into new ones.
Kilo for Slack is available now for teams with Kilo Code accounts. Users connect their GitHub repositories through Kilo's integrations dashboard, add the Slack integration, and can then mention @Kilo in any channel where the bot has been added. Usage-based pricing matches the rates of whatever model the team selects.
Whether a 34-person startup can execute on that vision against competitors with billions in capital remains an open question. But if Breitenother is right that the hard problem in AI coding isn't generating code but integrating into workflows, Kilo may have picked the right fight. After all, the best AI in the world doesn't matter much if developers have to leave the conversation to use it.
Anthropic's open source standard, the Model Context Protocol (MCP), released in late 2024, allows users to connect AI models and the agents atop them to external tools in a structured, reliable format. It is the engine behind Anthropic's hit AI agentic programming harness, Claude Code, allowing it to access numerous functions like web browsing and file creation immediately when asked.
But there was one problem: Claude Code typically had to "read" the instruction manual for every single tool available, regardless of whether it was needed for the immediate task, using up the available context that could otherwise be filled with more information from the user's prompts or the agent's responses.
At least until last night. The Claude Code team released an update that fundamentally alters this equation. Dubbed MCP Tool Search, the feature introduces "lazy loading" for AI tools, allowing agents to dynamically fetch tool definitions only when necessary.
It is a shift that moves AI agents from a brute-force architecture to something resembling modern software engineering—and according to early data, it effectively solves the "bloat" problem that was threatening to stifle the ecosystem.
To understand the significance of Tool Search, one must understand the friction of the previous system. The Model Context Protocol (MCP), released in 2024 by Anthropic as an open source standard was designed to be a universal standard for connecting AI models to data sources and tools—everything from GitHub repositories to local file systems.
However, as the ecosystem grew, so did the "startup tax."
Thariq Shihipar, a member of the technical staff at Anthropic, highlighted the scale of the problem in the announcement.
"We've found that MCP servers may have up to 50+ tools," Shihipar wrote. "Users were documenting setups with 7+ servers consuming 67k+ tokens."
In practical terms, this meant a developer using a robust set of tools might sacrifice 33% or more of their available context window limit of 200,000 tokens before they even typed a single character of a prompt, as AI newsletter author Aakash Gupta pointed out in a post on X.
The model was effectively "reading" hundreds of pages of technical documentation for tools it might never use during that session.
Community analysis provided even starker examples.
Gupta further noted that a single Docker MCP server could consume 125,000 tokens just to define its 135 tools.
"The old constraint forced a brutal tradeoff," he wrote. "Either limit your MCP servers to 2-3 core tools, or accept that half your context budget disappears before you start working."
The solution Anthropic rolled out — which Shihipar called "one of our most-requested features on GitHub" — is elegant in its restraint. Instead of preloading every definition, Claude Code now monitors context usage.
According to the release notes, the system automatically detects when tool descriptions would consume more than 10% of the available context.
When that threshold is crossed, the system switches strategies. Instead of dumping raw documentation into the prompt, it loads a lightweight search index.
When the user asks for a specific action—say, "deploy this container"—Claude Code doesn't scan a massive, pre-loaded list of 200 commands. Instead, it queries the index, finds the relevant tool definition, and pulls only that specific tool into the context.
"Tool Search flips the architecture," Gupta analyzed. "The token savings are dramatic: from ~134k to ~5k in Anthropic’s internal testing. That’s an 85% reduction while maintaining full tool access."
For developers maintaining MCP servers, this shifts the optimization strategy.
Shihipar noted that the `server instructions` field in the MCP definition—previously a "nice to have"—is now critical. It acts as the metadata that helps Claude "know when to search for your tools, similar to skills."
While the token savings are the headline metric—saving money and memory is always popular—the secondary effect of this update might be more important: focus.
LLMs are notoriously sensitive to "distraction." When a model's context window is stuffed with thousands of lines of irrelevant tool definitions, its ability to reason decreases. It creates a "needle in a haystack" problem where the model struggles to differentiate between similar commands, such as `notification-send-user` versus `notification-send-channel`.
Boris Cherny, Head of Claude Code, emphasized this in his reaction to the launch on X: "Every Claude Code user just got way more context, better instruction following, and the ability to plug in even more tools."
The data backs this up. Internal benchmarks shared by the community indicate that enabling Tool Search improved the accuracy of the Opus 4 model on MCP evaluations from 49% to 74%.
For the newer Opus 4.5, accuracy jumped from 79.5% to 88.1%.
By removing the noise of hundreds of unused tools, the model can dedicate its "attention" mechanisms to the user's actual query and the relevant active tools.
This update signals a maturation in how we treat AI infrastructure. In the early days of any software paradigm, brute force is common. But as systems scale, efficiency becomes the primary engineering challenge.
Aakash Gupta drew a parallel to the evolution of Integrated Development Environments (IDEs) like VSCode or JetBrains. "The bottleneck wasn’t 'too many tools.'
It was loading tool definitions like 2020-era static imports instead of 2024-era lazy loading," he wrote. "VSCode doesn’t load every extension at startup. JetBrains doesn’t inject every plugin’s docs into memory."
By adopting "lazy loading"—a standard best practice in web and software development—Anthropic is acknowledging that AI agents are no longer just novelties; they are complex software platforms that require architectural discipline.
For the end user, this update is seamless: Claude Code simply feels "smarter" and retains more memory of the conversation. But for the developer ecosystem, it opens the floodgates.
Previously, there was a "soft cap" on how capable an agent could be. Developers had to curate their toolsets carefully to avoid lobotomizing the model with excessive context. With Tool Search, that ceiling is effectively removed. An agent can theoretically have access to thousands of tools—database connectors, cloud deployment scripts, API wrappers, local file manipulators—without paying a penalty until those tools are actually touched.
It turns the "context economy" from a scarcity model into an access model. As Gupta summarized, "They’re not just optimizing context usage. They’re changing what ‘tool-rich agents’ can mean."
The update is rolling out immediately for Claude Code users. For developers building MCP clients, Anthropic recommends implementing the `ToolSearchTool` to support this dynamic loading, ensuring that as the agentic future arrives, it doesn't run out of memory before it even says hello.
Agentic systems and enterprise search depend on strong data retrieval that works efficiently and accurately. Database provider MongoDB thinks its newest embeddings models help solve falling retrieval quality as more AI systems go into production.
As agentic and RAG systems move into production, retrieval quality is emerging as a quiet failure point — one that can undermine accuracy, cost, and user trust even when models themselves perform well.
The company launched four new versions of its embeddings and reranking models. Voyage 4 will be available in four modes: voyage-4 embedding, voyage-4-large, voyage-4-lite, and voyage-4-nano.
MongoDB said the voyage-4 embedding serves as its general-purpose model; MongoDB considers Voyage-4-large its flagship model. Voyage-4-lite focuses on tasks requiring little latency and lower costs, and voyage-4-nano is intended for more local development and testing environments or for on-device data retrieval.
Voyage-4-nano is also MongoDB’s first open-weight model. All models are available via an API and on MongoDB’s Atlas platform.
The company said the models outperform similar models from Google and Cohere on the RTEB benchmark. Hugging Face’s RTEB benchmark puts Voyage 4 as the top embedding model.
“Embedding models are one of those invisible choices that can really make or break AI experiences,” Frank Liu, product manager at MongoDB, said in a briefing. “You get them wrong, your search results will feel pretty random and shallow, but if you get them right, your application suddenly feels like it understands your users and your data.”
He added that the goal of the Voyage 4 models is to improve the retrieval of real-world data, which often collapses once agentic and RAG pipelines go into production.
MongoDB also released a new multimodal embedding model, voyage-multimodal-3.5, that can handle documents that include text, images, and video. This model vectorizes the data and extracts semantic meaning from the tables, graphics, figures, and slides typically found in enterprise documents.
For enterprises, an agentic system is only as good as its ability to reliably retrieve the right information at the right time. This requirement becomes harder as workloads scale and context windows fragment.
Several model providers target that layer of agentic AI. Google’s Gemini Embedding model topped the embedding leaderboards, and Cohere launched its Embed 4 multimodal model, which processes documents more than 200 pages long. Mistral said its coding-embedding model, Codestral Embedding, outperforms Cohere, Google, and even MongoDB’s Voyage Code 3. MongoDB argues that benchmark performance alone doesn’t address the operational complexity enterprises face in production.
MongoDB said many clients have found that their data stacks cannot handle context-aware, retrieval-intensive workloads in production. The company said it's seeing more fragmentation with enterprises having to stitch together different solutions to connect databases with a retrieval or reranking model. To help customers who don’t want fragmented solutions, the company is offering its models through a single data platform, Atlas.
MongoDB’s bet is that retrieval can’t be treated as a loose collection of best-of-breed components anymore. For enterprise agents to work reliably at scale, embeddings, reranking, and the data layer need to operate as a tightly integrated system rather than a stitched-together stack.
As agentic AI moves from experiments to real production workloads, a quiet but serious infrastructure problem is coming into focus: memory. Not compute. Not models. Memory.
Under the hood, today’s GPUs simply don’t have enough space to hold the Key-Value (KV) caches that modern, long-running AI agents depend on to maintain context. The result is a lot of invisible waste — GPUs redoing work they’ve already done, cloud costs climbing, and performance taking a hit. It’s a problem that’s already showing up in production environments, even if most people haven’t named it yet.
At a recent stop on the VentureBeat AI Impact Series, WEKA CTO Shimon Ben-David joined VentureBeat CEO Matt Marshall to unpack the industry’s emerging “memory wall,” and why it’s becoming one of the biggest blockers to scaling truly stateful agentic AI — systems that can remember and build on context over time. The conversation didn’t just diagnose the issue; it laid out a new way to think about memory entirely, through an approach WEKA calls token warehousing.
“When we're looking at the infrastructure of inferencing, it is not a GPU cycles challenge. It's mostly a GPU memory problem,” said Ben-David.
The root of the issue comes down to how transformer models work. To generate responses, they rely on KV caches that store contextual information for every token in a conversation. The longer the context window, the more memory those caches consume, and it adds up fast. A single 100,000-token sequence can require roughly 40GB of GPU memory, noted Ben-David.
That wouldn’t be a problem if GPUs had unlimited memory. But they don’t. Even the most advanced GPUs top out at around 288GB of high-bandwidth memory (HBM), and that space also has to hold the model itself.
In real-world, multi-tenant inference environments, this becomes painful quickly. Workloads like code development or processing tax returns rely heavily on KV-cache for context.
“If I'm loading three or four 100,000-token PDFs into a model, that's it — I've exhausted the KV cache capacity on HBM,” said Ben-David. This is what’s known as the memory wall. “Suddenly, what the inference environment is forced to do is drop data," he added.
That means GPUs are constantly throwing away context they’ll soon need again, preventing agents from being stateful and maintaining conversations and context over time
“We constantly see GPUs in inference environments recalculating things they already did,” Ben-David said. Systems prefill the KV cache, start decoding, then run out of space and evict earlier data. When that context is needed again, the whole process repeats — prefill, decode, prefill again. At scale, that’s an enormous amount of wasted work. It also means wasted energy, added latency, and degraded user experience — all while margins get squeezed.
That GPU recalculation waste shows up directly on the balance sheet. Organizations can suffer nearly 40% overhead just from redundant prefill cycles This is creating ripple effects in the inference market.
“If you look at the pricing of large model providers like Anthropic and OpenAI, they are actually teaching users to structure their prompts in ways that increase the likelihood of hitting the same GPU that has their KV cache stored,” said Ben-David. “If you hit that GPU, the system can skip the prefill phase and start decoding immediately, which lets them generate more tokens efficiently.”
But this still doesn't solve the underlying infrastructure problem of extremely limited GPU memory capacity.
“How do you climb over that memory wall? How do you surpass it? That's the key for modern, cost- effective inferencing,” Ben-David said. “We see multiple companies trying to solve that in different ways.”
Some organizations are deploying new linear models that try to create smaller KV caches. Others are focused on tackling cache efficiency.
“To be more efficient, companies are using environments that calculate the KV cache on one GPU and then try to copy it from GPU memory or use a local environment for that,” Ben-David explained. “But how do you do that at scale in a cost-effective manner that doesn't strain your memory and doesn't strain your networking? That's something that WEKA is helping our customers with.”
Simply throwing more GPUs at the problem doesn’t solve the AI memory barrier. “There are some problems that you cannot throw enough money at to solve," Ben-David said.
WEKA’s answer is what it calls augmented memory and token warehousing — a way to rethink where and how KV cache data lives. Instead of forcing everything to fit inside GPU memory, WEKA’s Augmented Memory Grid extends the KV cache into a fast, shared “warehouse” within its NeuralMesh architecture.
In practice, this turns memory from a hard constraint into a scalable resource — without adding inference latency. WEKA says customers see KV cache hit rates jump to 96–99% for agentic workloads, along with efficiency gains of up to 4.2x more tokens produced per GPU.
Ben-David put it simply: "Imagine that you have 100 GPUs producing a certain amount of tokens. Now imagine that those hundred GPUs are working as if they're 420 GPUs."
For large inference providers, the result isn’t just better performance — it translates directly to real economic impact.
“Just by adding that accelerated KV cache layer, we're looking at some use cases where the savings amount would be millions of dollars per day,” said Ben-David
This efficiency multiplier also opens up new strategic options for businesses. Platform teams can design stateful agents without worrying about blowing up memory budgets. Service providers can offer pricing tiers based on persistent context, with cached inference delivered at dramatically lower cost.
NVIDIA projects a 100x increase in inference demand as agentic AI becomes the dominant workload. That pressure is already trickling down from hyperscalers to everyday enterprise deployments— this isn’t just a “big tech” problem anymore.
As enterprises move from proofs of concept into real production systems, memory persistence is becoming a core infrastructure concern. Organizations that treat it as an architectural priority rather than an afterthought will gain a clear advantage in both cost and performance.
The memory wall is not something organizations can simply outspend to overcome. As agentic AI scales, it is one of the first AI infrastructure limits that forces a deeper rethink, and as Ben-David’s insights made clear, memory may also be where the next wave of competitive differentiation begins.
The two big stories of AI in 2026 so far have been the incredible rise in usage and praise for Anthropic's Claude Code and a similar huge boost in user adoption for Google's Gemini 3 AI model family released late last year — the latter of which includes Nano Banana Pro (also known as Gemini 3 Pro Image), a powerful, fast, and flexible image generation model that renders complex, text-heavy infographics quickly and accurately, making it an excellent fit for enterprise use (think: collateral, trainings, onboarding, stationary, etc).
But of course, both of those are proprietary offerings. And yet, open source rivals have not been far behind.
This week, we got a new open source alternative to Nano Banana Pro in the category of precise, text-heavy image generators: GLM-Image, a new 16-billion parameter open-source model from recently public Chinese startup Z.ai.
By abandoning the industry-standard "pure diffusion" architecture that powers most leading image generator models in favor of a hybrid auto-regressive (AR) + diffusion design, GLM-Image has achieved what was previously thought to be the domain of closed, proprietary models: state-of-the-art performance in generating text-heavy, information-dense visuals like infographics, slides, and technical diagrams.
It even beats Google's Nano Banana Pro on the shared by z.ai — though in practice, my own quick usage found it to be far less accurate at instruction following and text rendering (and other users seem to agree).
But for enterprises seeking cost-effective and customizable, friendly-licensed alternatives to proprietary AI models, z.ai's GLM-Image may be "good enough" or then some to take over the job of a primary image generator, depending on their specific use cases, needs and requirements.
The most compelling argument for GLM-Image is not its aesthetics, but its precision. In the CVTG-2k (Complex Visual Text Generation) benchmark, which evaluates a model's ability to render accurate text across multiple regions of an image, GLM-Image scored a Word Accuracy average of 0.9116.
To put that number in perspective, Nano Banana 2.0 aka Pro—often cited as the benchmark for enterprise reliability—scored 0.7788. This isn't a marginal gain; it is a generational leap in semantic control.
While Nano Banana Pro retains a slight edge in single-stream English long-text generation (0.9808 vs. GLM-Image's 0.9524), it falters significantly when the complexity increases.
As the number of text regions grows, Nano Banana's accuracy remains in the 70s, whereas GLM-Image maintains >90% accuracy even with multiple distinct text elements.
For enterprise use cases—where a marketing slide needs a title, three bullet points, and a caption simultaneously—this reliability is the difference between a production-ready asset and a hallucination.
Unfortunately, my own usage of a demo inference of GLM-Image on Hugging Face proved to be less reliable than the benchmarks might suggest.
My prompt to generate an "infographic labeling all the major constellations visible from the U.S. Northern Hemisphere right now on Jan 14 2026 and putting faded images of their namesakes behind the star connection line diagrams" did not result in what I asked for, instead fulfilling maybe 20% or less of the specified content.
But Google's Nano Banana Pro handled it like a champ, as you'll see below:
Of course, a large portion of this is no doubt due to the fact that Nano Banana Pro is integrated with Google search, so it can look up information on the web in response to my prompt, whereas GLM-Image is not, and therefore, likely requires far more specific instructions about the actual text and other content the image should contain.
But still, once you're used to being able to type some simple instructions and get a fully researched and well populated image via the latter, it's hard to imagine deploying a sub-par alternative unless you have very specific requirements around cost, data residency and security — or the customizability needs of your organization are so great.
Furthermore, Nano Banana Pro still edged out GLM-Image in terms of pure aesthetics — using the OneIG benchmark, Nano Banana 2.0 is at 0.578 vs. GLM-Image at 0.528 — and indeed, as the top header artwork of this article indicates, GLM-Image does not always render as crisp, finely detailed and pleasing an image as Google's generator.
Why does GLM-Image succeed where pure diffusion models fail? The answer lies in Z.ai’s decision to treat image generation as a reasoning problem first and a painting problem second.
Standard latent diffusion models (like Stable Diffusion or Flux) attempt to handle global composition and fine-grained texture simultaneously.
This often leads to "semantic drift," where the model forgets specific instructions (like "place the text in the top left") as it focuses on making the pixels look realistic.
GLM-Image decouples these objectives into two specialized "brains" totaling 16 billion parameters:
The Auto-Regressive Generator (The "Architect"): Initialized from Z.ai’s GLM-4-9B language model, this 9-billion parameter module processes the prompt logically. It doesn't generate pixels; instead, it outputs "visual tokens"—specifically semantic-VQ tokens. These tokens act as a compressed blueprint of the image, locking in the layout, text placement, and object relationships before a single pixel is drawn. This leverages the reasoning power of an LLM, allowing the model to "understand" complex instructions (e.g., "A four-panel tutorial") in a way diffusion noise predictors cannot.
The Diffusion Decoder (The "Painter"): Once the layout is locked by the AR module, a 7-billion parameter Diffusion Transformer (DiT) decoder takes over. Based on the CogView4 architecture, this module fills in the high-frequency details—texture, lighting, and style.
By separating the "what" (AR) from the "how" (Diffusion), GLM-Image solves the "dense knowledge" problem. The AR module ensures the text is spelled correctly and placed accurately, while the Diffusion module ensures the final result looks photorealistic.
The secret sauce of GLM-Image’s performance isn't just the architecture; it is a highly specific, multi-stage training curriculum that forces the model to learn structure before detail.
The training process began by freezing the text word embedding layer of the original GLM-4 model while training a new "vision word embedding" layer and a specialized vision LM head.
This allowed the model to project visual tokens into the same semantic space as text, effectively teaching the LLM to "speak" in images. Crucially, Z.ai implemented MRoPE (Multidimensional Rotary Positional Embedding) to handle the complex interleaving of text and images required for mixed-modal generation.
The model was then subjected to a progressive resolution strategy:
Stage 1 (256px): The model trained on low-resolution, 256-token sequences using a simple raster scan order.
Stage 2 (512px - 1024px): As resolution increased to a mixed stage (512px to 1024px), the team observed a drop in controllability. To fix this, they abandoned simple scanning for a progressive generation strategy.
In this advanced stage, the model first generates approximately 256 "layout tokens" from a down-sampled version of the target image.
These tokens act as a structural anchor. By increasing the training weight on these preliminary tokens, the team forced the model to prioritize the global layout—where things are—before generating the high-resolution details. This is why GLM-Image excels at posters and diagrams: it "sketches" the layout first, ensuring the composition is mathematically sound before rendering the pixels.
For enterprise CTOs and legal teams, the licensing structure of GLM-Image is a significant competitive advantage over proprietary APIs, though it comes with a minor caveat regarding documentation.
The Ambiguity: There is a slight discrepancy in the release materials. The model’s Hugging Face repository explicitly tags the weights with the MIT License.
However, the accompanying GitHub repository and documentation reference the Apache License 2.0.
Why This Is Still Good News: Despite the mismatch, both licenses are the "gold standard" for enterprise-friendly open source.
Commercial Viability: Both MIT and Apache 2.0 allow for unrestricted commercial use, modification, and distribution. Unlike the "open rail" licenses common in other image models (which often restrict specific use cases) or "research-only" licenses (like early LLaMA releases), GLM-Image is effectively "open for business" immediately.
The Apache Advantage (If Applicable): If the code falls under Apache 2.0, this is particularly beneficial for large organizations. Apache 2.0 includes an explicit patent grant clause, meaning that by contributing to or using the software, contributors grant a patent license to users. This reduces the risk of future patent litigation—a major concern for enterprises building products on top of open-source codebases.
No "Infection": Neither license is "copyleft" (like GPL). You can integrate GLM-Image into a proprietary workflow or product without being forced to open-source your own intellectual property.
For developers, the recommendation is simple: Treat the weights as MIT (per the repository hosting them) and the inference code as Apache 2.0. Both paths clear the runway for internal hosting, fine-tuning on sensitive data, and building commercial products without a vendor lock-in contract.
For the enterprise decision maker, GLM-Image arrives at a critical inflection point. Companies are moving beyond using generative AI for abstract blog headers and into functional territory: multilingual localization of ads, automated UI mockup generation, and dynamic educational materials.
In these workflows, a 5% error rate in text rendering is a blocker. If a model generates a beautiful slide but misspells the product name, the asset is useless. The benchmarks suggest GLM-Image is the first open-source model to cross the threshold of reliability for these complex tasks.
Furthermore, the permissive licensing fundamentally changes the economics of deployment. While Nano Banana Pro locks enterprises into a per-call API cost structure or restrictive cloud contracts, GLM-Image can be self-hosted, fine-tuned on proprietary brand assets, and integrated into secure, air-gapped pipelines without data leakage concerns.
The trade-off for this reasoning capability is compute intensity. The dual-model architecture is heavy. Generating a single 2048x2048 image requires approximately 252 seconds on an H100 GPU. This is significantly slower than highly optimized, smaller diffusion models.
However, for high-value assets—where the alternative is a human designer spending hours in Photoshop—this latency is acceptable.
Z.ai also offers a managed API at $0.015 per image, providing a bridge for teams who want to test the capabilities without investing in H100 clusters immediately.
GLM-Image is a signal that the open-source community is no longer just fast-following proprietary labs; in specific, high-value verticals like knowledge-dense generation, they are now setting the pace. For the enterprise, the message is clear: if your operational bottleneck is the reliability of complex visual content, the solution is no longer necessarily a closed Google product—it might be an open-source model you can run yourself.
Rather than asking how AI agents can work for them, a key question in enterprise is now: Are agents playing well together?
This makes orchestration across multi-agent systems and platforms a critical concern — and a key differentiator.
“Agent-to-agent communications is emerging as a really big deal,” G2’s chief innovation officer Tim Sanders told VentureBeat. “Because if you don't orchestrate it, you get misunderstandings, like people speaking foreign languages to each other. Those misunderstandings reduce the quality of actions and raise the specter of hallucinations, which could be security incidents or data leakage.”
Orchestration to this point has largely been around data, but that’s quickly turning to action. “Conductor-like solutions” are increasingly bringing together agents, robotic process automation (RPA), and data repositories. Sanders likened the progression to that of answer engine optimization, which initially began with monitoring and now creates bespoke content and code.
“Orchestration platforms coordinate a variety of different agentic solutions to increase the consistency of outcomes,” he said.
Early providers include Salesforce MuleSoft, UiPath Maestro, and IBM Watsonx Orchestrate. These “phase one” software-based observability dashboards help IT leaders see all agentic actions across an enterprise.
But coordination can only add so much value; these platforms will morph into technical risk management tools that provide greater quality control. This could include, for instance, agent assessments, policy recommendation and proactive scoring (such as, how reliable agents are when they call on enterprise tools, or how often they hallucinate and when).
Enterprise leaders have become wary of relying on vendors to minimize risks and errors; many IT decision-makers, in fact, do not trust a vendor's statements about the reliability of their agents, he said.
Third-party tools are beginning to bridge the gap and automate tedious guardrail processes and escalation tickets. Teams are already experiencing “ticket exhaustion” in semi-automated systems, where agents hit guardrails and require human permission to proceed.
As an example: The loan process at a bank requires 17 steps for approval, and an agent keeps interrupting human workflows with approval requests when it runs into established guardrails.
Third-party orchestration platforms can manage these tickets and nay, yay, or even challenge the need for approval altogether. They can eventually eliminate the need for persistent human-in-the-loop oversight so organizations can experience “true velocity gains” measured not in percentages but in multiples (that is, 3X versus 30%).
“Where it goes from there is remote management of the entire agentic process for organizations,” Sanders said.
In another critical evolution in the agentic era, human evaluators will become designers, moving from human-in-the-loop to human-on-the-loop, according to Sanders. That is: They will begin designing agents to automate workflows.
Agent builder platforms continue to innovate their no-code solutions, Sanders said, meaning nearly anyone can now stand up an agent using natural language. “This will democratize agentic AI, and the super skill will be the ability to express a goal, provide context and envision pitfalls, very similar to a good people manager today.”
Agent-first automation stacks “dramatically outperform” hybrid automation stacks in almost every attribute, he noted: satisfaction, quality of actions, security, cost savings.
Organizations should begin “expeditious programs” to infuse agents across workflows, especially with highly repetitive work that poses bottlenecks. Likely at first, there will be a strong human-in-the-loop element to ensure quality and promote change management.
“Serving as an evaluator will strengthen the understanding of how these systems work,” Sanders said, “and eventually enable all of us to operate upstream in agentic workflows instead of downstream.”
IT leaders should take inventory today of all the different elements of their automation stack. Whether these elements are rules-based automation, RPA, or agentic automation, they must learn everything going on in the organization to optimally use emerging orchestration platforms.
“If they don't, there could actually be dis-synergies across organizations where old school technology and cutting edge technology clash at the point of delivery, oftentimes customer-facing,” Sanders said. “You can't orchestrate what you can't see clearly.”





























