Anthropic Just Shipped 10 More AI Agents. The Data Says Your Team Gets Slower After 4.
Anthropic shipped 10 new finance-services agents on Tuesday. By Wednesday, every managing director on the Street had a Slack DM from a junior analyst asking which one to install first. The honest answer, supported by three independent datasets nobody is quoting in those Slack threads, is none of them — not yet.
Here is the part that should worry the CFO who just signed the enterprise contract. The research on AI tool sprawl is not ambiguous, not preliminary and not a hot take. It is replicated, large-sample and pointing in one direction: somewhere between 3 and 4 AI tools, your team stops getting faster and starts getting slower.
What Anthropic actually shipped on May 5
The Anthropic announcement is not a chatbot update. It is 10 named, deployable agent templates aimed at the highest-margin labor on the Street. The lineup: Pitch Agent (builds pitchbooks from comps and precedents), Meeting Prep Agent (client briefing packs), Earnings Reviewer (earnings calls and model updates), Model Builder (DCF, LBO and three-statement models in Excel), Market Researcher (industry overviews), Valuation Reviewer (GP packages to LP reporting), GL Reconciler (finds general-ledger breaks and traces root cause), Month-End Closer (accruals, roll-forwards, variance commentary), Statement Auditor (audits LP statements), and KYC Screener (parses docs and runs rules).
Each agent ships three ways: as a Claude Cowork plugin for desk users, a Claude Code plugin for engineering teams, and a Claude Managed Agents cookbook for IT to deploy at scale. Same day, Anthropic also launched eight new data connectors and a Moody's MCP app. The repo is open-source under Apache 2.0. JPMorgan, Goldman Sachs and Bridgewater are the launch customers. Jamie Dimon got a quote in the press release.
The pitch is irresistible: drop these into your existing stack and watch the analyst grind disappear. The reality, if you read the productivity literature, is that the analyst grind does not disappear. It mutates into something more dangerous: a cognitive workload your senior team will not notice they have until it shows up in slipped close dates and missed exposures.
Study 1: BCG and HBR — the 1,488-worker brain fry survey
In March 2026, BCG and Harvard Business Review published the largest workplace study to date on AI tool sprawl. The sample: 1,488 US workers across industries, controlling for role, seniority and tenure. The headline finding is uncomfortable for every vendor selling another agent.
Productivity rose modestly moving from one AI tool to two. It plateaued between two and three. It declined from four onward. Workers running four or more AI tools were measurably less productive than workers running two. Not equally productive. Less.
Then it gets weirder. 14% of high-tool-count workers reported what BCG named "AI brain fry" — a cluster of symptoms including mental fog, headaches, decision fatigue and slower task switching. The brain-fry rate among workers using one or two tools was negligible. The rate among workers using five-plus was over 20%.
The mechanism is straightforward when you read the qualitative interviews. Each new AI tool adds a context-switching cost: a new prompt style, a new permission scope, a new failure mode, a new place to check whether the agent already did the thing. The cognitive overhead of orchestrating four agents exceeds the time the agents save. Past a certain point, you are not delegating work. You are project-managing it.
Study 2: Nature Human Behaviour — 106 experiments, one ugly number
The October 2024 Nature Human Behaviour meta-analysis is the paper enterprise AI vendors hope you do not read. The team aggregated 106 controlled experiments covering 370 individual effect sizes on human-AI collaboration tasks. The result, expressed in standard meta-analysis notation: Hedges' g = -0.23.
Translation: human-AI teams underperformed the better of human-alone or AI-alone on decision-making tasks. Not by a hair. By a quarter of a standard deviation, which in social science is the difference between a small and medium effect.
The breakdown is the part nobody quotes. Decision-making tasks — deepfake classification, demand forecasting, medical diagnosis, fraud detection — consistently lose. Human plus AI is worse than the better solo performer. The only category where human-AI combinations gained was open-ended creative work: brainstorming, drafting, ideation. Everything you would actually pay an analyst to do? The combo loses.
Why? The authors point to two causes. First, humans defer to AI confidence even when the model is wrong, and the wrongness compounds in multi-step tasks. Second, AI tools designed to help on average create "average-quality" outputs, which means the high-skill humans who would have produced better work alone get dragged toward the mean.
This is the finding that should make every Wall Street CFO read the contract twice. The work Anthropic's new agents target — KYC screening, GL reconciliation, valuation review, statement audits — is decision-making work. It is the exact category Nature found loses with human-AI collaboration. Not gains less. Loses.
Study 3: METR — the developer perception gap
The third dataset is the most damaging because it controls for the variable everyone uses to defend AI tools: developer self-report.
METR ran a randomized controlled trial in early 2025 with experienced open-source developers working on real codebases they already maintained. Half got AI tools. Half did not. Same tasks, same evaluation criteria. The result: the AI-assisted developers were 19% slower. They shipped fewer pull requests, took longer per task, and produced more rework loops.
Then METR asked the developers how they thought they performed. The same group that was 19% slower reported feeling 20% faster. That is a 39-point perception gap. Not slightly off. Not within the margin of error. Inverted.
The implication is brutal: every survey, every internal "productivity" study, every CIO testimonial relying on user-reported velocity is measuring the perception gap, not the work. The CFO who asks "is your team faster with these new agents?" will get yes from a team that is actually slower. They will not be lying. They will be wrong, and they will not know it.
What the CFO survey gets wrong
Gallup's Q1 2026 workforce survey covered 23,717 US employees. 50% reported using AI at work, up from 33% a year prior. Only 16% reported "extremely positive" impact. The other 84% rated the impact as marginal, neutral or negative. Yet enterprise AI spending is on track to grow another 60% this year.
The disconnect makes sense once you triangulate the three studies above. The people writing the checks are reading vendor case studies measuring perceived velocity. The people doing the work are quietly reporting brain fry. The middle managers are stuck in the middle, ordering more agents because the AI vendor's slide deck shows a 40% productivity uplift their own team has never seen.
The Anthropic problem in one sentence
Wall Street will not install Anthropic's 10 new agents in isolation. They will install them on top of Bloomberg Terminal, Excel, PitchBook, FactSet, Moody's MCP, S&P data feeds, internal compliance tooling and at least two existing chat-based assistants. That is a 10-tool baseline before Anthropic's agents land. After deployment: 20-tool stack, in a domain Nature already showed loses with human-AI collaboration, on tasks BCG already showed peak at 2 tools.
The exact prediction from the data: 14-20% of analysts will report brain fry within 90 days. Decision quality on KYC, valuations and reconciliations will degrade in ways that show up in audit findings, not user surveys. The senior reviewers signing off on agent-generated work product will catch the obvious errors and miss the subtle ones. Not because the agents are bad. Because the cognitive load of orchestrating them exceeds the load they save.
The two-tool rule
Here is the framework the BCG paper lands on, and the only piece of guidance that survives all three datasets.
Limit any one team to two AI tools. Pick them deliberately. One should handle the open-ended creative work where Nature found genuine gains: drafting, brainstorming, first-pass writing. The second should handle a single, narrow, decision-making task with a hard verification step at the end — a workflow where the human can check the answer in seconds, not minutes.
Anything past two tools should require an explicit business case showing the marginal task is decision-making (not creative), has a fast verification step (not slow), and replaces existing tool surface (not stacks on top of it). In practice, this means most teams should run Claude or ChatGPT for drafting, plus exactly one verticalized agent — and stop there.
The Anthropic announcement is interesting precisely because it gives buyers a way to consolidate. If a single vendor ships 10 finance-specific agents, the play is not to install all 10. It is to retire two existing tools, install one Anthropic agent that replaces both, and end the quarter with the same total tool count and a higher fraction of work flowing through agents that share context. That is the version of the AI rollout the data actually supports.
What the smart CFO does this week
Three moves the data supports, none of which the vendors are pitching.
First, count your team's current AI tools. Not officially licensed ones. Actually used. Most enterprise teams find a number between 5 and 9. That is the brain-fry zone. Cut it before adding anything.
Second, build the verification step into every decision-making AI workflow. The Nature paper's gain was on creative tasks because creative tasks have ambiguous quality criteria; the human reviewer brings new information. Decision tasks lost because the human review step rubber-stamped AI confidence. If your KYC screener flags a customer, you need a human checking the source documents, not approving a summary.
Third, instrument actual cycle time. Not user-reported velocity. Actual ticket-to-close, audit-finding-to-resolution, deal-to-pitchbook minutes. Compare a team using your full AI stack to a team using only two tools. The METR perception gap predicts the surveys will lie. The clocks will not.
Anthropic's new agents are well-built. The Apache-licensed repo is the cleanest reference implementation of finance-specific agent skills shipping anywhere right now. The problem is not the technology. The problem is the deployment pattern, which on every public dataset we have, makes teams slower past four tools.
The myth: more agents equals more productivity. The bust, supported by 1,488 workers, 106 experiments and a controlled developer trial: more agents past two equals more brain fry, worse decisions, and a perception gap that hides the damage. The CFO who installs all 10 of Anthropic's new templates will hear from her team that everything is going great. The audit logs will tell a different story by Q4.
Related Resources
- The infrastructure trend pulling the other direction: PageIndex ditches the entire vector RAG stack to consolidate retrieval into one reasoning step.
- Same reasoning-first approach exposed via MCP: PageIndex MCP — one server replaces a chunking, embedding and vector-DB pipeline.
- The skill bundle behind the Anthropic agents discussed in this article: Claude for Financial Services on Skila Repos.
- How enterprise teams are governing the resulting agent fleet: Microsoft Agent 365, the cross-cloud control plane for shadow AI.
- The forward-deployed services model behind Anthropic's enterprise push: Anthropic, Blackstone and Goldman's $1.5B JV.
Frequently Asked Questions
What is the AI productivity myth?
The AI productivity myth is the assumption that adding more AI tools automatically makes a team more productive. Three independent 2025-2026 studies — BCG/HBR (n=1,488), a Nature Human Behaviour meta-analysis of 106 experiments, and METR's randomized developer trial — all show productivity peaks around 2 AI tools and declines past 4. Heavy users report symptoms BCG named "AI brain fry": mental fog, headaches and slower decisions.
How many AI tools should my team use?
The BCG data points to a hard ceiling of two tools per team for measurable productivity gains. Pick one for open-ended creative work where human-AI combinations actually win, and one narrow decision-making tool with a fast human verification step. Past four tools, productivity declines and 14% of users report cognitive overload symptoms.
What did Anthropic announce on May 5, 2026?
Anthropic launched 10 financial-services agent templates — Pitch Agent, KYC Screener, GL Reconciler, Earnings Reviewer, Model Builder, Market Researcher, Valuation Reviewer, Month-End Closer, Statement Auditor and Meeting Prep Agent — alongside eight new data connectors and a Moody's MCP app. The repo is open-source under Apache 2.0. JPMorgan, Goldman Sachs and Bridgewater are the launch customers.
How does the METR developer study compare to vendor productivity claims?
The METR randomized trial measured experienced open-source developers and found AI tools made them 19% slower while the same developers reported feeling 20% faster — a 39-point perception gap. Vendor productivity claims rely on the same self-reported velocity METR proved is inverted. If you are evaluating an AI rollout, instrument actual cycle time rather than relying on user surveys.
Is the human-AI collaboration meta-analysis worth trusting?
Yes — it is the largest meta-analysis on the topic to date, published in Nature Human Behaviour. The team aggregated 106 controlled experiments covering 370 effect sizes and found a Hedges' g of -0.23 for human-AI combinations on decision-making tasks. Decision-heavy domains like fraud detection, medical diagnosis and demand forecasting consistently lost. Only open-ended creative tasks gained from human-AI collaboration.
Key Takeaways
- ✓Anthropic launched 10 new finance-services agent templates on May 5, 2026 — pitch builders, KYC screeners, GL reconcilers, model builders — the same week three separate studies showed AI tool stacks past 4 actively reduce productivity.
- ✓BCG and HBR studied 1,488 US workers across industries: productivity rose moving from 1 to 2 AI tools, plateaued at 3, and DECLINED at 4 or more. 14% of heavy users report 'AI brain fry' — mental fog and slower decisions.
- ✓A Nature Human Behaviour meta-analysis of 106 experiments (370 effect sizes) found human-AI combinations UNDERPERFORM the better of human-or-AI-alone on decision-making tasks by Hedges' g = -0.23.
- ✓METR's randomized trial of experienced open-source developers: AI tools made them 19% slower. The same developers believed they were 20% faster — a 39-point perception gap.
- ✓Wall Street will stack Anthropic's new agents on top of Bloomberg Terminal, Excel, PitchBook, FactSet, Moody's and S&P. The data predicts more brain fry, not more deals closed.
Skila AI Editorial Team
The Skila AI editorial team researches and writes original content covering AI tools, model releases, open-source developments, and industry analysis. Our goal is to cut through the noise and give developers, product teams, and AI enthusiasts accurate, timely, and actionable information about the fast-moving AI ecosystem.
About Skila AI →Related Resources
Weekly AI Digest
Get the top AI news, tool reviews, and developer insights delivered every week. No spam, unsubscribe anytime.
Join 1,000+ AI enthusiasts. Free forever.