





























San Francisco startup Anthropic continues to ship new AI products and services at a blistering pace, despite a messy ongoing dispute with the U.S. Department of War.
Today, the company announced Claude Marketplace, a new offering that lets enterprises with an existing Anthropic spend commitment apply part of it toward tools and applications powered by Anthropic's Claude models but made and offered by external partners including GitLab, Harvey, Lovable, Replit, Rogo and Snowflake.
According to Anthropic’s Claude Marketplace FAQ, the program is designed to simplify procurement and consolidate AI spend. Anthropic says the Marketplace is now in limited preview and that enterprises interested in using it should reach out to their Anthropic account team to get started.
For customers interested in the Marketplace, Anthropic says purchases made through it “count against a portion of your existing Anthropic commitment,” and that the company will manage invoicing for partner spend — meaning enterprises can use part of their existing Anthropic commitment to buy Claude-powered partner solutions without separately handling partner invoicing. In effect, Anthropic is positioning Claude Marketplace as a more centralized way for enterprises to procure certain Claude-powered partner tools.
Yet, the whole point of Anthropic's Claude Code and Claude Cowork applications for many users was that they could shift enterprise spend and time away from current third-party software-as-a-service (Saas) apps and instead, they could "vibe code" new solutions or bespoke, AI-powered workflows. This idea is so pervasive that prior Claude integrations have on several recent occasions caused a major selloff in SaaS stocks after investors thought Claude could threaten the underlying companies and applications. Claude Marketplace seems to be pushing against that idea, suggesting current SaaS apps are still valuable and perhaps even more useful and appealing to enterprises with Claude integrated into them.
The launch raises a broader question about how enterprises will choose to use Claude: directly through Anthropic’s own products and APIs, or through third-party applications that embed Claude for more specialized workflows.
Model and chat platforms have always sought to offer integrations, aiming to cut the time users spend building their app versions.
OpenAI added third-party apps into ChatGPT and launched a new App Directory in December 2025. This brought in offerings from companies such as Canva, Expedia and Figma that users can invoke by using "@" mentions while prompting on the chatbot.
However, three months in, it’s unclear exactly how many people use ChatGPT Apps, particularly in enterprises — will Claude's Marketplace be able achieve more success here, given rising enterprise adoption of Claude and Anthropic products?
ChatGPT’s focus in its integrated apps was on retail and individual consumer-focused tasks rather than the enterprise more broadly, but the company has also tried to appeal to that market with new plugins for ChatGPT released alongside its new GPT-5.4 this week.
Other AI tool marketplaces have also cropped up. Lightning AI launched an AI Hub last year following similar moves from AWS and Hugging Face. Many AI marketplaces, such as Salesforce's, focus on surfacing AI agents that may already have the capabilities customers need.
How does Anthropic's solution stand out from these? Asked for comment a spokesperson responded:
"Claude is a model — it reasons, writes, analyzes, and codes. But Harvey isn't just Claude with a legal prompt. It's a purpose-built platform built for how legal teams actually work — with the domain expertise, workflow integrations, compliance infrastructure, and institutional knowledge that enterprises require. Same with Rogo for finance, Snowflake for enterprise data, or GitLab for software development. These partners have spent years building the product layer on top of Claude that makes it useful for specific industries and workflows. That's actually the point. Thousands of businesses use Claude to power their products — and the best ones have built something Claude alone can't replicate. Claude Marketplace isn't Anthropic trying to replace those products. It's Anthropic investing in them — making it easier for enterprises to access the best Claude-powered tools without managing a separate procurement process for each one. Claude is the intelligence layer. Our partners are the product."
Enterprise users adapted their Claude or ChatGPT platforms to recognize preferences, connect to their data sources and retain context. So much of how people use enterprise AI these days focuses on customizability, on making the system work for their needs.
Platforms like OpenClaw also allowed people to set up autonomous agents that can have full access to their computers to complete tasks and execute workflows. In other words, Claude and other platforms can already do much of the work that these new third-party Marketplace tools enable — provided they have the right context and data.
However, third-party tools and integrations allow enterprise users to avoid doing the work themselves and instead invoke an existing tool to handle it. For those whose businesses are built around specific, tool-based workflows, the Marketplace may be exactly the right AI integration for them. In addition, there's also a good chance that enterprises already paying for Claude may now take advantage of the new Marketplace to explore third-party tools and services they wouldn't have otherwise.
While it’s still unclear what Claude Marketplace would look like in action, it’s possible that, with these tools, enterprises could use Claude as an orchestrator, where the platform acts as a command center that taps the right tool and accesses the right context without constantly prompting.
Observers noted that Claude Marketplace offers enterprises a way to “pre-approve” apps, bypassing the often long and cautious approval process.
Some people noted that Anthropic’s move tracks with how many businesses will want to work directly with the platforms without requiring users to move to their separate offerings.
Anthropic's biggest challenge with Claude Marketplace, however, is adoption. Many of the partners for its launch already have enterprise customers who deploy their tools through an API or already connect via MCP or other protocols for context.
Some users may have already vibe-coded apps that tap into these integrations. It's now a matter of enterprise users showing they want to use these new tools within their Claude workflows.
As models get smarter and more capable, the "harnesses" around them must also evolve. This "harness engineering" is an extension of context engineering, says LangChain co-founder and CEO Harrison Chase in a new VentureBeat Beyond the Pilot podcast episode. Whereas traditional AI harnesses have tended to constrain models from running in loops and calling tools, harnesses specifically built for AI agents allow them to interact more independently and effectively perform long-running tasks.
Chase also weighed in on OpenAI's acquisition of OpenClaw, arguing that its viral success came down to a willingness to "let it rip" in ways that no major lab would — and questioning whether the acquisition actually gets OpenAI closer to a safe enterprise version of the product. “The trend in harnesses is to actually give the large language model (LLM) itself more control over context engineering, letting it decide what it sees and what it doesn't see,” Chase says. “Now, this idea of a long-running, more autonomous assistant is viable.”
While the concept of allowing LLMs to run in a loop and call tools seems relatively simple, it’s difficult to pull off reliably, Chase noted. For a while, models were “below the threshold of usefulness” and simply couldn’t run in a loop, so devs used graphs and wrote chains to get around that. Chase pointed to AutoGPT — once the fastest-growing GitHub project ever — as a cautionary example: same architecture as today's top agents, but the models weren't good enough yet to run reliably in a loop, so it faded fast. But as LLMs keep improving, teams can construct environments where models can run in loops and plan over longer horizons, and they can continually improve these harnesses. Previously, “you couldn't really make improvements to the harness because you couldn't actually run the model in a harness,” Chase said. LangChain’s answer to this is Deep Agents, a customizable general-purpose harness. Built on LangChain and LangGraph, it has planning capabilities, a virtual filesystem, context and token management, code execution, and skills and memory functions. Further, it can delegate tasks to subagents; these are specialized with different tools and configurations and can work in parallel. Context is also isolated, meaning subagent work doesn’t clutter the main agent’s context, and large subtask context is compressed into a single result for token efficiency. All of these agents have access to file systems, Chase explained, and can essentially create to-do lists that they can execute on and track over time. “When it goes on to the next step, and it goes on to step two or step three or step four out of a 200 step process, it has a way to track its progress and keep that coherence,” Chase said. “It comes down to letting the LLM write its thoughts down as it goes along, essentially.” He emphasized that harnesses should be designed so that models can maintain coherence over longer tasks, and be “amenable” to models deciding when to compact context at points it determines is “advantageous.” Also, giving agents access to code interpreters and BASH tools increases flexibility. And, providing agents with skills as opposed to just tools loaded up front allows them to load information when they need it. “So rather than hard code everything into one big system prompt," Chase explained, "you could have a smaller system prompt, ‘This is the core foundation, but if I need to do X, let me read the skill for X. If I need to do Y, let me read the skill for Y.'" Essentially, context engineering is a “really fancy” way of saying: What is the LLM seeing? Because that’s different from what developers see, he noted. When human devs can analyze agent traces, they can put themselves in the AI’s “mindset” and answer questions like: What is the system prompt? How is it created? Is it static or is it populated? What tools does the agent have? When it makes a tool call, and gets a response back, how is that presented? “When agents mess up, they mess up because they don't have the right context; when they succeed, they succeed because they have the right context,” Chase said. “I think of context engineering as bringing the right information in the right format to the LLM at the right time.” Listen to the podcast to hear more about:
How LangChain built its stack: LangGraph as the core pillar, LangChain at the center, Deep Agents on top.
Why code sandboxes will be the next big thing.
How a different type of UX will evolve as agents run at longer intervals (or continuously).
Why traces and observability are core to building an agent that actually works.
You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.
Enterprise AI applications that handle large documents or long-horizon tasks face a severe memory bottleneck. As the context grows longer, so does the KV cache, the area where the model’s working memory is stored.
A new technique developed by researchers at MIT addresses this challenge with a fast compression method for the KV cache. The technique, called Attention Matching, manages to compact the context by up to 50x with very little loss in quality.
While it is not the only memory compaction technique available, Attention Matching stands out for its execution speed and impressive information-preserving capabilities.
Large language models generate their responses sequentially, one token at a time. To avoid recalculating the entire conversation history from scratch for every predicted word, the model stores a mathematical representation of every previous token it has processed, also known as the key and value pairs. This critical working memory is known as the KV cache.
The KV cache scales with conversation length because the model is forced to retain these keys and values for all previous tokens in a given interaction. This consumes expensive hardware resources. "In practice, KV cache memory is the biggest bottleneck to serving models at ultra-long context," Adam Zweiger, co-author of the paper, told VentureBeat. "It caps concurrency, forces smaller batches, and/or requires more aggressive offloading."
In modern enterprise use cases, such as analyzing massive legal contracts, maintaining multi-session customer dialogues, or running autonomous coding agents, the KV cache can balloon to many gigabytes of memory for a single user request.
To solve this massive bottleneck, the AI industry has tried several strategies, but these methods fall short when deployed in enterprise environments where extreme compression is necessary. A class of technical fixes includes optimizing the KV cache by either evicting tokens the model deems less important or merging similar tokens into a single representation. These techniques work for mild compression but “degrade rapidly at high reduction ratios,” according to the authors.
Real-world applications often rely on simpler techniques, with the most common approach being to simply drop the older context once the memory limit is reached. But this approach causes the model to lose older information as the context grows long. Another alternative is context summarization, where the system pauses, writes a short text summary of the older context, and replaces the original memory with that summary. While this is an industry standard, summarization is highly lossy and heavily damages downstream performance because it might remove pertinent information from the context.
Recent research has proven that it is technically possible to highly compress this memory using a method called Cartridges. However, this approach requires training latent KV cache models through slow, end-to-end mathematical optimization. This gradient-based training can take several hours on expensive GPUs just to compress a single context, making it completely unviable for real-time enterprise applications.
Attention Matching achieves high-level compaction ratios and quality while being orders of magnitude faster than gradient-based optimization. It bypasses the slow training process through clever mathematical tricks.
The researchers realized that to perfectly mimic how an AI interacts with its memory, they need to preserve two mathematical properties when compressing the original key and value vectors into a smaller footprint. The first is the “attention output,” which is the actual information the AI extracts when it queries its memory. The second is the “attention mass,” which acts as the mathematical weight that a token has relative to everything else in the model’s working memory. If the compressed memory can match these two properties, it will behave exactly like the massive, original memory, even when new, unpredictable user prompts are added later.
"Attention Matching is, in some ways, the 'correct' objective for doing latent context compaction in that it directly targets preserving the behavior of each attention head after compaction," Zweiger said. While token-dropping and related heuristics can work, explicitly matching attention behavior simply leads to better results.
Before compressing the memory, the system generates a small set of “reference queries” that act as a proxy for the types of internal searches the model is likely to perform when reasoning about the specific context. If the compressed memory can accurately answer these reference queries, it will very likely succeed at answering the user's actual questions later. The authors suggest various methods for generating these reference queries, including appending a hidden prompt to the document telling the model to repeat the previous context, known as the “repeat-prefill” technique. They also suggest a “self-study” approach where the model is prompted to perform a few quick synthetic tasks on the document, such as aggregating all key facts or structuring dates and numbers into a JSON format.
With these queries in hand, the system picks a set of keys to preserve in the compacted KV cache based on signals like the highest attention value. It then uses the keys and reference queries to calculate the matching values along with a scalar bias term. This bias ensures that pertinent information is preserved, allowing each retained key to represent the mass of many removed keys.
This formulation makes it possible to fit the values with simple algebraic techniques, such as ordinary least squares and nonnegative least squares, entirely avoiding compute-heavy gradient-based optimization. This is what makes Attention Matching super fast in comparison to optimization-heavy compaction methods. The researchers also apply chunked compaction, processing contiguous chunks of the input independently and concatenating them, to further improve performance on long contexts.
To understand how this method performs in the real world, the researchers ran a series of stress tests using popular open-source models like Llama 3.1 and Qwen-3 on two distinct types of enterprise datasets. The first was QuALITY, a standard reading comprehension benchmark using 5,000 to 8,000-word documents. The second, representing a true enterprise challenge, was LongHealth, a highly dense, 60,000-token dataset containing the complex medical records of multiple patients.
The key finding was the ability of Attention Matching to compact the model’s KV cache by 50x without reducing the accuracy, while taking only seconds to process the documents. To achieve that same level of quality previously, Cartridges required hours of intensive GPU computation per context.
When dealing with the dense medical records, standard industry workarounds completely collapsed. The researchers noted that when they tried to use standard text summarization on these patient records, the model’s accuracy dropped so low that it matched the “no-context” baseline, meaning the AI performed as if it had not read the document at all.
Attention Matching drastically outperforms summarization, but enterprise architects will need to dial down the compression ratio for dense tasks compared to simpler reading comprehension tests. As Zweiger explains, "The main practical tradeoff is that if you are trying to preserve nearly everything in-context on highly information-dense tasks, you generally need a milder compaction ratio to retain strong accuracy."
The researchers also explored what happens in cases where absolute precision isn't necessary but extreme memory savings are. They ran Attention Matching on top of a standard text summary. This combined approach achieved 200x compression. It successfully matched the accuracy of standard summarization alone, but with a very small memory footprint.
One of the interesting experiments for enterprise workflows was testing online compaction, though they note that this is a proof of concept and has not been tested rigorously in production environments. The researchers tested the model on the advanced AIME math reasoning test. They forced the AI to solve a problem with a strictly capped physical memory limit. Whenever the model’s memory filled up, the system paused, instantly compressed its working memory by 50 percent using Attention Matching, and let it continue thinking. Even after hitting the memory wall and having its KV cache shrunk up to six consecutive times mid-thought, the model successfully solved the math problems. Its performance matched a model that had been given massive, unlimited memory.
There are caveats to consider. At a 50x compression ratio, Attention Matching is the clear winner in balancing speed and quality. However, if an enterprise attempts to push compression to extreme 100x limits on highly complex data, the slower, gradient-based Cartridges method actually outperforms it.
The researchers have released the code for Attention Matching. However, they note that this is not currently a simple plug-and-play software update. "I think latent compaction is best considered a model-layer technique," Zweiger notes. "While it can be applied on top of any existing model, it requires access to model weights." This means enterprises relying entirely on closed APIs cannot implement this themselves; they need open-weight models.
The authors note that integrating this latent-space KV compaction into existing, highly optimized commercial inference engines still requires significant effort. Modern AI infrastructure uses complex tricks like prefix caching and variable-length memory packing to keep servers running efficiently, and seamlessly weaving this new compaction technique into those existing systems will take dedicated engineering work. However, there are immediate enterprise applications. "We believe compaction after ingestion is a promising use case, where large tool call outputs or long documents are compacted right after being processed," Zweiger said.
Ultimately, the shift toward mechanical, latent-space compaction aligns with the future product roadmaps of major AI players, Zweiger argues. "We are seeing compaction to shift from something enterprises implement themselves into something model providers ship," Zweiger said. "This is even more true for latent compaction, where access to model weights is needed. For example, OpenAI now exposes a black-box compaction endpoint that returns an opaque object rather than a plain-text summary."
Google senior AI product manager Shubham Saboo has turned one of the thorniest problems in agent design into an open-source engineering exercise: persistent memory.
This week, he published an open-source “Always On Memory Agent” on the official Google Cloud Platform Github page under a permissive MIT License, allowing for commercial usage.
It was built with Google's Agent Development Kit, or ADK introduced last Spring in 2025, and Gemini 3.1 Flash-Lite, a low-cost model Google introduced on March 3, 2026 as its fastest and most cost-efficient Gemini 3 series model.
The project serves as a practical reference implementation for something many AI teams want but few have productionized cleanly: an agent system that can ingest information continuously, consolidate it in the background, and retrieve it later without relying on a conventional vector database.
For enterprise developers, the release matters less as a product launch than as a signal about where agent infrastructure is headed.
The repo packages a view of long-running autonomy that is increasingly attractive for support systems, research assistants, internal copilots and workflow automation. It also brings governance questions into sharper focus as soon as memory stops being session-bound.
The repo also appears to use a multi-agent internal architecture, with specialist components handling ingestion, consolidation and querying.
But the supplied materials do not clearly establish a broader claim that this is a shared memory framework for multiple independent agents.
That distinction matters. ADK as a framework supports multi-agent systems, but this specific repo is best described as an always-on memory agent, or memory layer, built with specialist subagents and persistent storage.
Even at this narrower level, it addresses a core infrastructure problem many teams are actively working through.
According to the repository, the agent runs continuously, ingests files or API input, stores structured memories in SQLite, and performs scheduled memory consolidation every 30 minutes by default.
A local HTTP API and Streamlit dashboard are included, and the system supports text, image, audio, video and PDF ingestion. The repo frames the design with an intentionally provocative claim: “No vector database. No embeddings. Just an LLM that reads, thinks, and writes structured memory.”
That design choice is likely to draw attention from developers managing cost and operational complexity. Traditional retrieval stacks often require separate embedding pipelines, vector storage, indexing logic and synchronization work.
Saboo's example instead leans on the model to organize and update memory directly. In practice, that can simplify prototypes and reduce infrastructure sprawl, especially for smaller or medium-memory agents. It also shifts the performance question from vector search overhead to model latency, memory compaction logic and long-run behavioral stability.
That is where Gemini 3.1 Flash-Lite enters the story.
Google says the model is built for high-volume developer workloads at scale and priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens.
The company also says Flash-Lite is 2.5 times faster than Gemini 2.5 Flash in time to first token and delivers a 45% increase in output speed while maintaining similar or better quality.
On Google’s published benchmarks, the model posts an Elo score of 1432 on Arena.ai, 86.9% on GPQA Diamond and 76.8% on MMMU Pro. Google positions those characteristics as a fit for high-frequency tasks such as translation, moderation, UI generation and simulation.
Those numbers help explain why Flash-Lite is paired with a background-memory agent. A 24/7 service that periodically re-reads, consolidates and serves memory needs predictable latency and low enough inference cost to avoid making “always on” prohibitively expensive.
Google’s ADK documentation reinforces the broader story. The framework is presented as model-agnostic and deployment-agnostic, with support for workflow agents, multi-agent systems, tools, evaluation and deployment targets including Cloud Run and Vertex AI Agent Engine. That combination makes the memory agent feel less like a one-off demo and more like a reference point for a broader agent runtime strategy.
Public reaction shows why enterprise adoption of persistent memory will not hinge on speed or token pricing alone.
Several responses on X highlighted exactly the concerns enterprise architects are likely to raise. Franck Abe called Google ADK and 24/7 memory consolidation “brilliant leaps for continuous agent autonomy,” but warned that an agent “dreaming” and cross-pollinating memories in the background without deterministic boundaries becomes “a compliance nightmare.”
ELED made a related point, arguing that the main cost of always-on agents is not tokens but “drift and loops.”
Those critiques go directly to the operational burden of persistent systems: who can write memory, what gets merged, how retention works, when memories are deleted, and how teams audit what the agent learned over time?
Another reaction, from Iffy, challenged the repo’s “no embeddings” framing, arguing that the system still has to chunk, index and retrieve structured memory, and that it may work well for small-context agents but break down once memory stores become much larger.
That criticism is technically important. Removing a vector database does not remove retrieval design; it changes where the complexity lives.
For developers, the tradeoff is less about ideology than fit. A lighter stack may be attractive for low-cost, bounded-memory agents, while larger-scale deployments may still demand stricter retrieval controls, more explicit indexing strategies and stronger lifecycle tooling.
Other commenters focused on developer workflow. One asked for the ADK repo and documentation and wanted to know whether the runtime is serverless or long-running, and whether tool-calling and evaluation hooks are available out of the box.
Based on the supplied materials, the answer is effectively both: the memory-agent example itself is structured like a long-running service, while ADK more broadly supports multiple deployment patterns and includes tools and evaluation capabilities.
The always-on memory agent is interesting on its own, but the larger message is that Saboo is trying to make agents feel like deployable software systems rather than isolated prompts. In that framing, memory becomes part of the runtime layer, not just an add-on feature.
What Saboo has not shown yet is just as important as what he's published.
The provided materials do not include a direct Flash-Lite versus Anthropic Claude Haiku benchmark for agent loops in production use.
They also do not lay out enterprise-grade compliance controls specific to this memory agent, such as: deterministic policy boundaries, retention guarantees, segregation rules or formal audit workflows.
And while the repo appears to use multiple specialist agents internally, the materials do not clearly prove a larger claim about persistent memory shared across multiple independent agents.
For now, the repo reads as a compelling engineering template rather than a complete enterprise memory platform.
Still, the release lands at the right time. Enterprise AI teams are moving beyond single-turn assistants and into systems expected to remember preferences, preserve project context and operate across longer horizons.
Saboo's open-source memory agent offers a concrete starting point for that next layer of infrastructure, and Flash-Lite gives the economics some credibility.
But the strongest takeaway from the reaction around the launch is that continuous memory will be judged on governance as much as capability.
That is the real enterprise question behind Saboo's demo: not whether an agent can remember, but whether it can remember in ways that stay bounded, inspectable and safe enough to trust in production.
What's old is new: the command line — the original, clunky non-graphical interface for interacting with and controlling PCs, where the user just typed in raw commands in code — has become one of the most important interfaces in agentic AI.
That shift has been driven in part by the rise of coding-native tools such as Claude Code and Kilo CLI, which have helped establish a model where AI agents do not just answer questions in chat windows but execute real tasks through a shared, scriptable interface already familiar to developers — and which can still be found on virtually all PCs.
For developers, the appeal is practical: the CLI is inspectable, composable and easier to control than a patchwork of custom app integrations.
Now, Google Workspace — the umbrella term for Google's suite of enterprise cloud apps including Drive, Gmail, Calendar, Sheets, Docs, Chat, Admin — is moving into that pattern with a new CLI that lets them access these applications and the data within them directly, without relying on third-party connectors.
The project, googleworkspace/cli, describes itself as “one CLI for all of Google Workspace — built for humans and AI agents,” with structured JSON output and agent-oriented workflows included.
In an X post yesterday, Google Cloud director Addy Osmani introduced the Google Workspace CLI as “built for humans and agents,” adding that it covers “Google Drive, Gmail, Calendar, and every Workspace API.”
While not officially supported by Google, other posts cast the release as a broader turning point for automation and agent access to enterprise productivity software.
Now, instead of having to set up third-party connectors like Zapier to access data and use AI agents to automate work across the Google Workspace suite of apps, enterprise developers (or indie devs and users, for that matter) can easily install the open source (Apache 2.0) Google Workspace CLI from Github and begin setting up automated agentic workflows directly in terminal, asking their AI model to sort email, respond, edit docs and files, and more.
For enterprise developers, the importance of the release is not that Google suddenly made Workspace programmable. Workspace APIs have long been available. What changes here is the interface.
Instead of forcing teams to build and maintain separate wrappers around individual APIs, the CLI offers a unified command surface with structured output.
Installation is straightforward — npm install -g @googleworkspace/cli — and the repo says the package includes prebuilt binaries, with releases also available through GitHub.
The repo also says gws reads Google’s Discovery Service at runtime and dynamically builds its command surface, allowing new Workspace API methods to appear without waiting for a manually maintained static tool definition to catch up.
For teams building agents or internal automation, that is a meaningful operational advantage. It reduces glue code, lowers maintenance overhead and makes Workspace easier to treat as a programmable runtime rather than a collection of separate SaaS applications.
The CLI is designed for both direct human use and agent-driven workflows. For developers working in the terminal, the README highlights features such as per-resource help, dry-run previews, schema inspection and auto-pagination.
For agents, the value is clearer still: structured JSON output, reusable commands and built-in skills that let models interact with Workspace data and actions without a custom integration layer.
That creates immediate utility for internal enterprise workflows. Teams can use the tool to list Drive files, create spreadsheets, inspect request and response schemas, send Chat messages and paginate through large result sets from the terminal. The README also says the repo ships more than 100 agent skills, including helpers and curated recipes for Gmail, Drive, Docs, Calendar and Sheets.
That matters because Workspace remains one of the most common systems of record for day-to-day business work. Email, calendars, internal docs, spreadsheets and shared files are often where operational context lives. A CLI that exposes those surfaces through a common, agent-friendly interface makes it easier to build assistants that retrieve information, trigger actions and automate repetitive processes with less bespoke plumbing.
The social-media response has been enthusiastic, but enterprises should read the repo carefully before treating the project as a formal Google platform commitment.
The README explicitly says: “This is not an officially supported Google product”. It also says the project is under active development and warns users to expect breaking changes as it moves toward v1.0.
That does not diminish the technical relevance of the release. It does, however, shape how enterprise teams should think about adoption. Today, this looks more like a promising developer tool with strong momentum than a production platform that large organizations should standardize on immediately.
The other key point is that the CLI does not bypass the underlying controls that govern Workspace access.
The documentation says users still need a Google Cloud project for OAuth credentials and a Google account with Workspace access. It also outlines multiple authentication patterns for local development, CI and service accounts, along with instructions for enabling APIs and handling setup issues.
For enterprises, that is the right way to interpret the tool. It is not magic access to Gmail, Docs or Sheets. It is a more usable abstraction over the same permissions, scopes and admin controls companies already manage.
Some of the early commentary around the tool frames it as a cleaner alternative to Model Context Protocol (MCP)-heavy setups, arguing that CLI-driven execution can avoid wasting context window on large tool definitions. There is some logic to that argument, especially for agent systems that can call shell commands directly and parse JSON responses.
But the repo itself presents a more nuanced picture. It includes a Gemini CLI extension that gives Gemini agents access to gws commands and Workspace agent skills after terminal authentication. It also includes an MCP server mode through gws mcp, exposing Workspace APIs as structured tools for MCP-compatible clients including Claude Desktop, Gemini CLI and VS Code.
The strategic takeaway is not that Google Workspace is choosing CLI instead of MCP. It is that the CLI is emerging as the base interface, with MCP available where it makes sense.
The right near-term move for enterprises is not broad rollout. It is targeted evaluation.
Developer productivity, platform engineering and IT automation teams should test the tool in a sandboxed Workspace environment and identify a narrow set of high-friction use cases where a CLI-first approach could reduce integration work. File discovery, spreadsheet updates, document generation, calendar operations and internal reporting are natural starting points.
Security and identity teams should review authentication patterns early and determine how tightly permissions, scopes and service-account usage can be constrained and monitored. AI platform teams, meanwhile, should compare direct CLI execution against MCP-based approaches in real workflows, focusing on reliability, prompt overhead and operational simplicity.
The broader trend is clear. As agentic software matures, the command line is becoming a common control plane for both developers and AI systems. Google Workspace’s new CLI does not change enterprise automation overnight. But it does make one of the most widely used productivity stacks easier to access through the interface that agent builders increasingly prefer.
The AI updates aren't slowing down. Literally two days after OpenAI launched a new underlying AI model for ChatGPT called GPT-5.3 Instant, the company has unveiled another, even more massive upgrade: GPT-5.4.
Actually, GPT-5.4 comes in two varieties: GPT-5.4 Thinking and GPT-5.4 Pro, the latter designed for the most complex tasks.
Both will be available in OpenAI's paid application programming interface (API) and Codex software development application, while GPT-5.4 Thinking will be available to all paid subscribers of ChatGPT (Plus, the $20-per-month plan, and up) and Pro will be reserved for ChatGPT Pro ($200 monthly) and Enterprise plan users.
ChatGPT Free users will also get a taste of GPT-5.4, but only when their queries are auto-routed to the model, according to an OpenAI spokesperson.
The big headlines on this release are efficiency, with OpenAI reporting that GPT-5.4 uses far fewer tokens (47% fewer on some tasks) than its predecessors, and, arguably even more impressively, a new "native" Computer Use mode available through the API and its Codex that lets GPT-5.4 navigate a users' computer like a human and work across applications.
The company is also releasing a new suite of ChatGPT integrations allowing GPT-5.4 to be plugged directly into users' Microsoft Excel and Google Sheets spreadsheets and cells, enabling granular analysis and automated task completion that should speed up work across the enterprise, but may make fears of white collar layoffs even more pronounced on the heels of similar offerings from Anthropic's Claude and its new Cowork application.
OpenAI says GPT-5.4 supports up to 1 million tokens of context in the API and Codex, enabling agents to plan, execute, and verify tasks across long horizons— however, it charges double the cost per 1 million tokens once the input exceeds 272,000 tokens.
The most consequential capability OpenAI highlights is that GPT-5.4 is its first general-purpose model released with native, state-of-the-art computer-use capabilities in Codex and the API, enabling agents to operate computers and carry out multi-step workflows across applications.
OpenAI says the model can both write code to operate computers via libraries like Playwright and issue mouse and keyboard commands in response to screenshots. OpenAI also claims a jump in agentic web browsing.
Benchmark results are presented as evidence that this is not merely a UI wrapper.
On BrowseComp, which measures how well AI agents can persistently browse the web to find hard-to-locate information, OpenAI reports GPT-5.4 improving by 17% absolute over GPT-5.2, and GPT-5.4 Pro reaching 89.3%, described as a new state of the art.
On OSWorld-Verified, which measures desktop navigation using screenshots plus keyboard and mouse actions, OpenAI reports GPT-5.4 at 75.0% success, compared to 47.3% for GPT-5.2, and notes reported human performance at 72.4%.
On WebArena-Verified, GPT-5.4 reaches 67.3% success using both DOM- and screenshot-driven interaction, compared to 65.4% for GPT-5.2. On Online-Mind2Web, OpenAI reports 92.8% success using screenshot-based observations alone.
OpenAI also links computer use to improvements in vision and document handling. On MMMU-Pro, GPT-5.4 reaches 81.2% success without tool use, compared with 79.5% for GPT-5.2, and OpenAI says it achieves that result using a fraction of the “thinking tokens.”
On OmniDocBench, GPT-5.4’s average error is reported at 0.109, improved from 0.140 for GPT-5.2. The post also describes expanded support for high-fidelity image inputs, including an “original” detail level up to 10.24M pixels.
OpenAI positions GPT-5.4 as built for longer, multi-step workflows—work that increasingly looks like an agent keeping state across many actions rather than a chatbot responding once.
As tool ecosystems get larger, OpenAI argues that the naive approach—dumping every tool definition into the prompt—creates a tax paid on every request: cost, latency, and context pollution.
GPT-5.4 introduces tool search in the API as a structural fix. Instead of receiving all tool definitions upfront, the model receives a lightweight list of tools plus a search capability, and it retrieves full tool definitions only when they’re actually needed.
OpenAI describes the efficiency win with a concrete comparison: on 250 tasks from Scale’s MCP Atlas benchmark, running with 36 MCP servers enabled, the tool-search configuration reduced total token usage by 47% while achieving the same accuracy as a configuration that exposed all MCP functions directly in context.
That 47% figure is specifically about the tool-search setup in that evaluation—not a blanket claim that GPT-5.4 uses 47% fewer tokens for every kind of task.
OpenAI’s coding pitch is that GPT-5.4 combines the coding strengths of GPT-5.3-Codex with stronger tool and computer-use capabilities that matter when tasks aren’t single-shot.
GPT-5.4 matches or outperforms GPT-5.3-Codex on SWE-Bench Pro while being lower latency across reasoning efforts.
Codex also gets workflow-level knobs. OpenAI says /fast mode delivers up to 1.5× faster performance across supported models, including GPT-5.4, describing it as the same model and intelligence “just faster.”
And it describes releasing an experimental Codex skill, “Playwright (Interactive)”, meant to demonstrate how coding and computer use can work in tandem—visually debugging web and Electron apps and testing an app as it’s being built.
Alongside GPT-5.4, OpenAI is announcing a suite of secure AI products in ChatGPT built for enterprises and financial institutions, powered by GPT-5.4 for advanced financial reasoning and Excel-based modeling.
The centerpiece is ChatGPT for Excel and Google Sheets (beta), which OpenAI describes as ChatGPT embedded directly in spreadsheets to build, analyze, and update complex financial models using the formulas and structures teams already rely on.
The suite also includes new ChatGPT app integrations intended to unify market, company, and internal data into a single workflow, naming FactSet, MSCI, Third Bridge, and Moody’s.
And it introduces reusable “Skills” for recurring finance work such as earnings previews, comparables analysis, DCF analysis, and investment memo drafting.
OpenAI anchors the finance push with an internal benchmark claim: model performance increased from 43.7% with GPT-5 to 88.0% with GPT-5.4 Thinking on an OpenAI internal investment banking benchmark.
OpenAI leans on benchmarks intended to resemble real office deliverables, not just puzzle-solving. On GDPval, an evaluation spanning “well-specified knowledge work” across 44 occupations, OpenAI reports that GPT-5.4 matches or exceeds industry professionals in 83.0% of comparisons, compared to 71.0% for GPT-5.2.
The company also highlights specific improvements in the kinds of artifacts that tend to expose model weaknesses: structured tables, formulas, narrative coherence, and design quality.
In an internal benchmark of spreadsheet modeling tasks modeled after what a junior investment banking analyst might do, GPT-5.4 reaches a mean score of 87.5%, compared to 68.4% for GPT-5.2.
And on a set of presentation evaluation prompts, OpenAI says human raters preferred GPT-5.4’s presentations 68.0% of the time over GPT-5.2’s, citing stronger aesthetics, greater visual variety, and more effective use of image generation.
OpenAI describes GPT-5.4 as its most factual model yet and connects that claim to a practical dataset: de-identified prompts where users previously flagged factual errors. On that set, OpenAI reports GPT-5.4’s individual claims are 33% less likely to be false and its full responses are 18% less likely to contain any errors compared to GPT-5.2.
In statements provided to VentureBeat from OpenAI and attributed early GPT-5.4 testers, Daniel Swiecki of Walleye Capital says that on internal finance and Excel evaluations, GPT-5.4 improved accuracy by 30 percentage points, which he links to expanded automation for model updates and scenario analysis.
Brendan Foody, CEO of Mercor, calls GPT-5.4 the best model the company has tried and says it’s now top of Mercor’s APEX-Agents benchmark for professional services work, emphasizing long-horizon deliverables like slide decks, financial models, and legal analysis.
In the API, OpenAI says GPT-5.4 Thinking is available as gpt-5.4 and GPT-5.4 Pro as gpt-5.4-pro. Pricing is as follows:
GPT-5.4: $2.50 / 1M input tokens; $15 / 1M output tokens
GPT-5.4 Pro: $30 / 1M input tokens; $180 / 1M output tokens
Batch + Flex: half-rate; Priority processing: 2× rate
This makes GPT-5.4 among the more expensive models to run over API compared to the entire field, as seen in the table below.
Model | Input | Output | Total Cost | Source |
Qwen 3 Turbo | $0.05 | $0.20 | $0.25 | |
Qwen3.5-Flash | $0.10 | $0.40 | $0.50 | |
deepseek-chat (V3.2-Exp) | $0.28 | $0.42 | $0.70 | |
deepseek-reasoner (V3.2-Exp) | $0.28 | $0.42 | $0.70 | |
Grok 4.1 Fast (reasoning) | $0.20 | $0.50 | $0.70 | |
Grok 4.1 Fast (non-reasoning) | $0.20 | $0.50 | $0.70 | |
MiniMax M2.5 | $0.15 | $1.20 | $1.35 | |
Gemini 3.1 Flash-Lite | $0.25 | $1.50 | $1.75 | |
MiniMax M2.5-Lightning | $0.30 | $2.40 | $2.70 | |
Gemini 3 Flash Preview | $0.50 | $3.00 | $3.50 | |
Kimi-k2.5 | $0.60 | $3.00 | $3.60 | |
GLM-5 | $1.00 | $3.20 | $4.20 | |
ERNIE 5.0 | $0.85 | $3.40 | $4.25 | |
Claude Haiku 4.5 | $1.00 | $5.00 | $6.00 | |
Qwen3-Max (2026-01-23) | $1.20 | $6.00 | $7.20 | |
Gemini 3 Pro (≤200K) | $2.00 | $12.00 | $14.00 | |
GPT-5.2 | $1.75 | $14.00 | $15.75 | |
Claude Sonnet 4.6 | $3.00 | $15.00 | $18.00 | |
GPT-5.4 | $2.50 | $15.00 | $17.50 | |
Gemini 3 Pro (>200K) | $4.00 | $18.00 | $22.00 | |
Claude Opus 4.6 | $5.00 | $25.00 | $30.00 | |
GPT-5.2 Pro | $21.00 | $168.00 | $189.00 | |
GPT-5.4 Pro | $30.00 | $180.00 | $210.00 |
Another important note: with GPT-5.4, requests that exceed 272,000 input tokens are billed at 2X the normal rate, reflecting the ability to send prompts larger than earlier models supported.
In Codex, compaction defaults to 272k tokens, and the higher long-context pricing applies only when the input exceeds 272k—meaning developers can keep sending prompts at or under that size without triggering the higher rate, but can opt into larger prompts by raising the compaction limit, with only those larger requests billed differently.
An OpenAI spokesperson said that in the API the maximum output is 128,000 tokens, the same as previous models.
Finally, on why GPT-5.4 is priced higher at baseline, the spokesperson attributed it to three factors: higher capability on complex tasks (including coding, computer use, deep research, advanced document generation, and tool use), major research improvements from OpenAI’s roadmap, and more efficient reasoning that uses fewer reasoning tokens for comparable tasks—adding that OpenAI believes GPT-5.4 remains below comparable frontier models on pricing even with the increase.
Across the release and the follow-up clarifications, GPT-5.4 is positioned as a model meant to move beyond “answer generation” and into sustained professional workflows—ones that require tool orchestration, computer interaction, long context, and outputs that look like the artifacts people actually use at work.
OpenAI’s emphasis on token efficiency, tool search, native computer use, and reduced user-flagged factual errors all point in the same direction: making agentic systems more viable in production by lowering the cost of retries—whether that retry is a human re-prompting, an agent calling another tool, or a workflow re-running because the first pass didn’t stick.
Most enterprise RAG pipelines are optimized for one search behavior. They fail silently on the others. A model trained to synthesize cross-document reports handles constraint-driven entity search poorly. A model tuned for simple lookup tasks falls apart on multi-step reasoning over internal notes. Most teams find out when something breaks.
Databricks set out to fix that with KARL, short for Knowledge Agents via Reinforcement Learning. The company trained an agent across six distinct enterprise search behaviors simultaneously using a new reinforcement learning algorithm. The result, the company claims, is a model that matches Claude Opus 4.6 on a purpose-built benchmark at 33% lower cost per query and 47% lower latency, trained entirely on synthetic data the agent generated itself with no human labeling required. That comparison is based on KARLBench, which Databricks built to evaluate enterprise search behaviors.
"A lot of the big reinforcement learning wins that we've seen in the community in the past year have been on verifiable tasks where there is a right and a wrong answer," Jonathan Frankle, Chief AI Scientist at Databricks, told VentureBeat in an exclusive interview. "The tasks that we're working on for KARL, and that are just normal for most enterprises, are not strictly verifiable in that same way."
Those tasks include synthesizing intelligence across product manager meeting notes, reconstructing competitive deal outcomes from fragmented customer records, answering questions about account history where no single document has the full answer and generating battle cards from unstructured internal data. None of those has a single correct answer that a system can check automatically.
"Doing reinforcement learning in a world where you don't have a strict right and wrong answer, and figuring out how to guide the process and make sure reward hacking doesn't happen — that's really non-trivial," Frankle said. "Very little of what companies do day to day on knowledge tasks are verifiable."
Standard RAG breaks down on ambiguous, multi-step queries drawing on fragmented internal data that was never designed to be queried.
To evaluate KARL, Databricks built the KARLBench benchmark to measure performance across six enterprise search behaviors: constraint-driven entity search, cross-document report synthesis, long-document traversal with tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation and fact aggregation over internal company notes. That last task is PMBench, built from Databricks' own product manager meeting notes — fragmented, ambiguous and unstructured in ways that frontier models handle poorly.
Training on any single task and testing on the others produces poor results. The KARL paper shows that multi-task RL generalizes in ways single-task training does not. The team trained KARL on synthetic data for two of the six tasks and found it performed well on all four it had never seen.
To build a competitive battle card for a financial services customer, for example, the agent has to identify relevant accounts, filter for recency, reconstruct past competitive deals and infer outcomes — none of which is labeled anywhere in the data.
Frankle calls what KARL does "grounded reasoning": running a difficult reasoning chain while anchoring every step in retrieved facts. "You can think of this as RAG," he said, "but like RAG plus plus plus plus plus plus, all the way up to 200 vector database calls."
KARL's training is powered by OAPL, short for Optimal Advantage-based Policy Optimization with Lagged Inference policy. It's a new approach, developed jointly by researchers from Cornell, Databricks and Harvard and published in a separate paper the week before KARL.
Standard LLM reinforcement learning uses on-policy algorithms like GRPO (Group Relative Policy Optimization), which assume the model generating training data and the model being updated are in sync. In distributed training, they never are. Prior approaches corrected for this with importance sampling, introducing variance and instability. OAPL embraces the off-policy nature of distributed training instead, using a regression objective that stays stable with policy lags of more than 400 gradient steps, 100 times more off-policy than prior approaches handled. In code generation experiments, it matched a GRPO-trained model using roughly three times fewer training samples.
OAPL's sample efficiency is what keeps the training budget accessible. Reusing previously collected rollouts rather than requiring fresh on-policy data for every update meant the full KARL training run stayed within a few thousand GPU hours. That is the difference between a research project and something an enterprise team can realistically attempt.
There has been a lot of discussion in the industry in recent months about how RAG can be replaced with contextual memory, also sometimes referred to as agentic memory.
For Frankle, it's not an either/or discussion, rather he sees it as a layered stack. A vector database with millions of entries sits at the base, which is too large for context. The LLM context window sits at the top. Between them, compression and caching layers are emerging that determine how much of what an agent has already learned it can carry forward.
For KARL, this is not abstract. Some KARLBench tasks required 200 sequential vector database queries, with the agent refining searches, verifying details and cross-referencing documents before committing to an answer, exhausting the context window many times over. Rather than training a separate summarization model, the team let KARL learn compression end-to-end through RL: when context grows too large, the agent compresses it and continues, with the only training signal being the reward at the end of the task. Removing that learned compression dropped accuracy on one benchmark from 57% to 39%.
"We just let the model figure out how to compress its own context," Frankle said. "And this worked phenomenally well."
Frankle was candid about the failure modes. KARL struggles most on questions with significant ambiguity, where multiple valid answers exist and the model can't determine whether the question is genuinely open-ended or just hard to answer. That judgment call is still an unsolved problem.
The model also exhibits what Frankle described as giving up early on some queries — stopping before producing a final answer. He pushed back on framing this as a failure, noting that the most expensive queries are typically the ones the model gets wrong anyway. Stopping is often the right call.
KARL was also trained and evaluated exclusively on vector search. Tasks requiring SQL queries, file search, or Python-based calculation are not yet in scope. Frankle said those capabilities are next on the roadmap, but they are not in the current system.
KARL surfaces three decisions worth revisiting for teams evaluating their retrieval infrastructure.
The first is pipeline architecture. If your RAG agent is optimized for one search behavior, the KARL results suggest it is failing on others. Multi-task training across diverse retrieval behaviors produces models that generalize. Narrow pipelines do not.
The second is why RL matters here — and it's not just a training detail. Databricks tested the alternative: distilling from expert models via supervised fine-tuning. That approach improved in-distribution performance but produced negligible gains on tasks the model had never seen. RL developed general search behaviors that transferred. For enterprise teams facing heterogeneous data and unpredictable query types, that distinction is the whole game. The third is what RL efficiency actually means in practice. A model trained to search better completes tasks in fewer steps, stops earlier on queries it cannot answer, diversifies its search rather than repeating failed queries, and compresses its own context rather than running out of room. The argument for training purpose-built search agents rather than routing everything through general-purpose frontier APIs is not primarily about cost. It is about building a model that knows how to do the job.





























