The Lyceum: Agentic AI Weekly — Mar 24, 2026
Photo: lyceumnews.com
Week of March 24, 2026
The Big Picture
An AI agent ran 700 experiments in two days, found optimizations a human expert missed, and committed the results to git while he slept. Meanwhile, Databricks published data from 20,000 organizations showing that the companies actually shipping agents to production aren't the fastest movers — they're the ones that built governance frameworks first. The theme this week isn't "agents are coming." It's that agents are already working, and the bottleneck has shifted from capability to plumbing: memory, fault tolerance, audit trails, and the unsexy infrastructure that keeps autonomous systems from quietly breaking at 3 a.m.
This Week's Stories
Karpathy Let an AI Agent Run His Lab for Two Days. It Found What He Couldn't.
The most important AI story this week wasn't a model launch. It was a 630-line Python script and a GPU left running overnight.
Andrej Karpathy — the former Tesla AI director and OpenAI co-founder who's become something like the field's most-watched independent researcher — published an open-source framework called autoresearch and let an AI coding agent run it continuously for two days. The task: improve the training speed of a small AI model Karpathy believed was already well-optimized. The agent conducted 700 experiments — roughly one every four minutes — and found an 11% training speed gain in that two-day run, including a subtle normalization bug humans had missed. By the first morning, 50 experiments were done and the results were committed to git without a single human instruction in between.
The architecture rests on three primitives simple enough to fit on an index card: an editable asset (a single file the agent can modify), a scalar metric (a number the computer can evaluate without human judgment), and a time-boxed cycle (a fixed window for each experiment). Together, these constraints let the agent iterate at roughly 100 experiments per sleep cycle. Karpathy described the broader shift in a Business Insider interview: he's largely stopped hand-coding and now spends more time "expressing intent" to agents, treating himself as a manager rather than a programmer.
The community response was immediate. Independent developers forked the pattern into loops like AGR (Autonomous Git Research), which auto-accept changes that pass predefined improvement thresholds — one demo ran 14 experiments and auto-kept 7. Shopify CEO Tobias Lütke ran a production instance and reported a 19% improvement in that overnight run of 37 experiments.
If this pattern works at scale, it changes what "researcher" means: less hands-on experimenter, more manager of autonomous experimentation loops. The failure mode is equally clear — without robust rollback, test coverage, and accountability mechanisms, autonomous changes that pass a narrow metric could introduce regressions humans don't catch until production breaks. The signal to watch: if a frontier AI lab publishes internal autoresearch results, the approach has moved from clever demo to mainstream R&D practice.
5 Million Developers Have a New Open-Source Coding Agent — and It's Picking a Standards Fight
The paid coding-agent market — Cursor, Claude Code, GitHub Copilot — has looked like a settled oligopoly. This week, Hacker News reminded everyone it isn't.
OpenCode is an open-source terminal-based coding agent that crossed 120,000 GitHub stars, 800 contributors, and 5 million monthly users. Its Hacker News thread pulled over 1,200 upvotes and turned into a sprawling debate about what developers actually want from a coding agent in 2026. The answer, apparently, is flexibility: OpenCode supports over 75 models — Claude, OpenAI, Gemini, and local models — and works with any editor supporting the Agent Client Protocol (ACP), an emerging standard for how code editors talk to coding agents. Compatible editors already include JetBrains IDEs, Zed, Neovim, and Emacs.
That last point matters more than the spec sheet suggests. A coding agent that isn't locked to one model provider is a hedge against every AI pricing and policy risk at once. And ACP is positioning itself as the USB standard for coding agents — a neutral interoperability layer that, if widely adopted, would make vendor lock-in structurally harder. Nobody's formally declared a standards war, but that's what this is.
A community-built plugin system called OpenPackage now lets developers install Claude Code plugins into OpenCode with a single command. Projects like Zeroshot are treating OpenCode as an "agent substrate" — a foundation for running clusters of agents rather than a single assistant. The honest reality: Hacker News commenters noted it's rougher around the edges than Cursor or Claude Code, and configuration can be complex. But the 5 million user figure suggests it's crossed the "developer toy" threshold. If JetBrains or VS Code formally announces ACP support, the interoperability fight will have its first major battle won. If ACP stalls, the coding-agent market stays fragmented and proprietary.
The First Real-World Report Card for AI Agents Is Here — and It's Humbling
For all the demos and benchmarks, we've lacked one thing: a large-scale evaluation where real money was on the line. Upwork just provided it.
Upwork's March 2026 Human-Agent Productivity Index ran top AI agents on over 300 real freelance projects — the kind of work people actually pay for. The headline result is blunt: agents working alone completed about 2.5% of tasks in that evaluation. That's not a typo. Two and a half percent.
But the story flips when humans enter the picture. Pairing an expert freelancer with an agent boosted completion rates by as much as 70% in the study. The implication is immediate and practical: for most real knowledge work today, the optimal configuration is human-plus-agent, not agent-alone. Agents are scalable, tireless collaborators — not replacements.
If this holds across larger studies, it reshapes how companies should budget for AI: not as headcount reduction, but as capability amplification. The failure scenario is that executives read the "agent" marketing and skip the "human" part, deploying autonomous systems on tasks that still need judgment. The signal to watch is whether enterprise procurement teams start requiring human-agent pair benchmarks — not just solo agent scores — before signing contracts.
The Production Numbers Nobody Is Quoting
Everyone argues about whether AI agents are ready for real work. Databricks just released data from 20,000+ organizations that answers the question with numbers instead of opinions.
According to a March 2026 industry roundup aggregating multiple reports: as of March 2026, 67% of Fortune 500 companies now have at least one AI agent in production, up from 34% in 2025. Customer service leads at 42% of deployments in the report, followed by data analysis (28%) and coding assistance (19%). JPMorgan has expanded to 200+ specialized financial analysis agents, per the roundup. Shopify handles 60% of merchant support tickets autonomously, according to the report. Financial operations teams report 30–50% acceleration in close processes where agents have reached production maturity.
But the finding your CFO should read is buried in the Databricks analysis: companies using AI governance tools get over 12 times more AI projects into production than those without them. That's not a rounding error — it's a structural multiplier that inverts the usual "move fast, govern later" logic. The organizations actually shipping agents aren't the ones moving fastest. They're the ones that built guardrails first.
If the governance-first pattern holds, the next 18 months of enterprise AI work isn't about better models or bigger budgets. It's about monitoring, audit trails, and approval workflows — the plumbing nobody wants to build. The observable signal: watch Q1 earnings calls from major banks and retailers this week. If JPMorgan or Walmart volunteers agent deployment ROI numbers publicly, it'll be the most concrete cost data the industry has seen.
Dapr Agents Hits 1.0 — and the Timing Is Not a Coincidence
Most AI agents today run on frameworks that weren't designed for agents. Dapr Agents v1.0, announced as generally available on March 23, changes that calculation.
Dapr — Distributed Application Runtime — is the Cloud Native Computing Foundation framework already used by thousands of companies to run cloud-native applications. The 1.0 release bakes native agent support into that proven infrastructure: secure multi-agent coordination, state management, and failure recovery, all built on top of a runtime that has already survived countless production incidents. This is the difference between asking an enterprise to adopt a brand-new framework and telling them the infrastructure they already trust now speaks agent.
The release landed the same day Karpathy's autoresearch story peaked, and it received less attention than it might have otherwise. That's a shame, because for enterprise IT teams, this may be the more consequential development. Bolting agent capabilities onto battle-tested cloud-native plumbing is a fundamentally different proposition than yet another research-grade Python framework.
If major cloud providers — AWS, Azure, Google Cloud — fast-track managed Dapr Agents integrations, this becomes the enterprise default for agent infrastructure. If they don't, it remains a strong option for teams already in the Dapr ecosystem but doesn't reshape the broader market. Watch for managed-service announcements in the next 30 days.
MCP Gets Its First Full Threat Map — and It's 38 Problems Long
The Model Context Protocol (MCP) — the open standard that lets AI agents call external services, increasingly the plumbing behind tool-using agents everywhere — just got its first comprehensive security audit. The results are sobering.
A new preprint catalogs 38 distinct threat categories mapped to both OWASP's AI top-10 and a separate "agentic applications" top-10. The authors argue MCP creates attack surfaces that are fundamentally different from classic web apps or standalone language models, especially around what they call "semantic" vulnerabilities: tool description poisoning (where an attacker manipulates how an agent understands what a tool does) and parasitic tool chaining (where one compromised tool redirects an agent's entire workflow). Prior empirical work cited in the paper shows higher attack success rates for MCP-based agents versus bespoke integrations.
No mainstream vendor has publicly committed to the paper's framework yet, but the trajectory is clear: MCP has crossed from "cool plumbing" to "critical infrastructure that needs its own security industry." If you start seeing "MCP-38-aware" in enterprise RFPs, this is the paper they'll be pointing at. If the framework is ignored, MCP becomes the next Log4j — ubiquitous infrastructure with known vulnerabilities that nobody patches until something breaks publicly.
Cognitive Science Heavyweights Publish a Blueprint for Agents That Actually Learn
Emmanuel Dupoux, Yann LeCun, and Jitendra Malik — three researchers who've repeatedly set the field's agenda — published a position paper with a blunt title: "Why AI systems don't learn and what to do about it."
Their argument: today's large language models and agents mostly don't learn in the way humans or animals do. They're trained once, then frozen, with no ongoing adaptation driven by curiosity or self-directed behavior. The workarounds — saving information to databases, retrieval-augmented generation — are engineering tricks, not genuine learning. The paper proposes a three-part architecture: one system that learns from passive observation, another that learns from active behavior, and a meta-controller that decides when to switch between the two based on internal goals and uncertainty.
This is a roadmap, not a product — there's no implementation yet. But it lands alongside a trending preprint on catastrophic forgetting (the problem where teaching a neural network something new overwrites what it knew before), which underscores the same gap from the engineering side. Together, they frame the core limitation: an agent that has to be retrained from scratch every time it encounters a new problem type isn't truly autonomous. The signal to watch is whether a major lab announces a prototype explicitly citing this architecture. Until then, the honest answer is that long-running agents "remember" through external databases, not internal growth.
⚡ What Most People Missed
- IQVIA has deployed 150 AI agents across 19 of the top 20 pharmaceutical companies, using NVIDIA's Nemotron models and Agent Toolkit, per NVIDIA's March 2026 announcement. Pharma is one of the most regulated industries on earth. If agents are running at this scale inside it, the "agents aren't ready for regulated industries" argument just took a serious hit.
- Distributed autoresearch swarms raise new resource and security trade-offs. Within days of Karpathy's release, community forks coordinated agents across machines; similar efforts show how distributed experimentation can unlock scale but also amplify vulnerabilities and quota problems. The consequence: teams that embrace distributed autoresearch will need tighter resource orchestration and threat modeling, not just more compute.
- An open-source memory layer called Signet claims roughly double the retrieval quality of standard RAG and aims to make agent "brains" portable across tools like Claude Code and OpenCode. If a major IDE vendor adopts it, agents that remember your project across weeks and sessions become a product feature rather than a research aspiration.
- Elixir is quietly becoming an agent orchestration language. Jido 2.0 shipped with built-in agent-to-agent handoffs, state persistence across crashes, and swarm coordination primitives — all running on the BEAM virtual machine that powers WhatsApp and telecom switches. A companion library called
agent_obsadds OpenTelemetry tracing for every agent loop and tool call. When your agent framework's selling point is "it restarts itself when it crashes," you're building for production, not demos. - API rate limits are a real failure mode for overnight agents. This week surfaced a concrete example: an xAI Grok HTTP 429 error halted an agent pipeline with the message "your team has either used all available credits or reached its monthly spending limit." Long-running agents need quota-aware designs — multi-provider routing, local model fallbacks, credit monitoring — or they silently fail at 3 a.m.
📅 What to Watch
- If a frontier AI lab publishes internal autoresearch results — not a demo, but actual R&D output — it means autonomous experiment loops are moving from open-source novelty to how models get trained, which reshapes the economics of AI research staffing.
- If enterprise procurement teams start requiring human-agent pair benchmarks (not solo agent scores) before signing contracts, the Upwork data has changed how agents are evaluated industry-wide — and "agent replaces worker" pitches will stop working in boardrooms.
- If AWS, Azure, or Google Cloud announces a managed Dapr Agents service in the next 30 days, cloud-native agent infrastructure becomes an enterprise default rather than a framework choice, and the dozens of standalone agent orchestration startups face an existential distribution problem.
- If Anthropic or Google DeepMind counters OpenAI's BCG/McKinsey/Accenture/Capgemini consulting partnerships within 60 days, it confirms that consulting firms — not cloud providers — are the critical distribution layer for enterprise agents, which means the real competition is for partner mindshare, not developer mindshare.
- If a vendor ships an "MCP-38-aware" security product referencing this week's threat taxonomy paper, MCP security becomes a product category rather than an academic concern — and every company running MCP-based agents faces a new compliance checkbox.
From the Lyceum
The White House asked Congress to preempt state AI rules and route oversight through existing regulators; as of March 24, 2026, the request had been sent to Congress but no specific bill text, committee referral, or vote had been reported. Read → The White House Hands Congress an AI Rulebook — and Tells the States to Stand Down
The Closer
A Python script ran 700 experiments while its creator slept and found what he couldn't; a freelance marketplace proved that agents alone finish 2.5% of real jobs in Upwork's March 2026 study; and a telecom-grade runtime from the 1980s turned out to be the best foundation for AI swarms in 2026.
The most reliable agent in production this week was the one that crashed, restarted itself, and kept going — which is also a pretty good description of everyone reading this newsletter on a Monday morning.
Stay ungoverned at your own risk.
If someone you know is building, buying, or worrying about agents — forward this. They'll thank you before the next overnight run finishes.