Agent Systems Design Space: Architecture, Competition, and the Horizon Ahead
A summary and analysis of the Claude Code Architecture study (arXiv:2604.14228v1)
Design Space Analysis - Five Values, Thirteen Principles
The paper’s starting point is honest: architecture is values made concrete. Before describing a single component, the authors identify five human values that drive the design of an AI coding agent:
Human Decision Authority : the human remains the point of control, even when the agent could plausibly proceed autonomously.
Safety, Security, and Privacy : system design treats these not as features to add later but as structural constraints.
Reliable Execution : agents must produce consistent, repeatable outcomes rather than fluky brilliance followed by spectacular failures.
Capability Amplification : the system exists to make the human more capable, not to demonstrate the model’s capabilities.
Contextual Adaptability : the agent must gracefully handle radically different contexts, from a two-person startup to a regulated enterprise.
These five values translate, somewhat heroically, into thirteen design principles. The most interesting is “deny-first permission evaluation” : the system assumes any action requires explicit permission until proven otherwise. This is architecturally unusual. Most agent frameworks adopt an open-by-default model where tools are available unless explicitly restricted. Claude Code inverts this, treating permission as a first-class architectural concern rather than a later addition.
The permission system itself is notably sophisticated: seven distinct modes (plan, default, auto, dontAsk, bypassPermissions, bubble for subagent escalation, and acceptEdits), backed by an ML-based classifier. The classifier evaluates each action against the permission context and decides whether to surface a prompt to the user. This is not a hard-coded rules engine; it learned from real usage patterns.
The five-layer context compaction pipeline is the other substantial architectural contribution. Rather than treating context as cheap and abundant, the design treats it as a scarce resource to be managed actively. The pipeline has five stages: budget reduction, snip, microcompact, context collapse, and auto-compact; each representing a progressively more aggressive form of context reduction. This acknowledges the uncomfortable reality that context windows are finite, expensive, and degrades gracefully as you approach their limits.
The architecture also implements four extensibility mechanisms : MCP servers, plugins, skills, and hooks; each with different context costs and capability profiles. This is a thoughtful recognition that users will inevitably want to extend the system, and the question is not whether to enable extension but how to make it survivable from a context standpoint.
A striking finding : only about 1.6% of the codebase handles actual AI decision-making. The remaining 98.4% is operational infrastructure. This should alert anyone who has spent the last two years arguing about which foundation model to use. The scaffold matters more than the model. More on this later.
Architecture Comparison with OpenClaw - Same Question, Different Answers
The comparison between Claude Code and OpenClaw is the paper’s most intellectually satisfying section, precisely because it resists the temptation to declare a winner. Instead, it reveals how identical design questions produce contextually appropriate answers.
Both systems are AI coding agents. Both must handle the same fundamental challenges : how to evaluate permissions, how to manage context, how to expose extensibility. And yet there are a few notable differences :
Permission evaluation : Claude Code performs per-action safety evaluation within a CLI loop. OpenClaw employs perimeter-level access control within a gateway control plane. These are not merely different implementations; they reflect different deployment philosophies. Claude Code assumes a single user at a terminal, making fast per-action decisions tractable. OpenClaw assumes a multi-user gateway context where perimeter control is more efficient than per-action prompting. Neither is universally right.
Context management : Claude Code’s five-layer compaction pipeline has a direct analogue in OpenClaw’s gateway-wide capability registration. The mechanisms differ but the underlying problem is identical: context is expensive, and you need a strategy for managing it before it runs out. The architectural difference reflects deployment context: CLI agents see context as a per-session problem; gateway agents see it as a system-wide resource to be allocated across many concurrent sessions.
Extensibility : Claude Code’s four-tier extensibility model (MCP servers, plugins, skills, hooks) has a rough parallel in OpenClaw’s gateway extension mechanisms. The OpenClaw architecture appears to lean more heavily on gateway-level registration, whereas Claude Code distributes extensibility across different cost profiles.
The deeper point is one the paper makes quietly but clearly : the design space for agent systems is context-dependent in ways that make direct comparison philosophically suspect. Claude Code is designed for a developer sitting at a terminal who wants a capable, safe, fast coding partner. OpenClaw is designed for organizations that need to deploy agents at scale behind a gateway with consistent governance. These are genuinely different problems, and the architectures reflect those differences. A framework that declares one superior to the other without specifying the deployment context is not making a scientific claim : it is making noise.
This is a welcome corrective to the benchmark-driven discourse that dominates agent system evaluation. SWE-bench scores, HumanEval results, and similar metrics are useful signals but they are not architecture evaluations. The paper’s implicit argument (that architecture choices are too important to leave to leaderboard position comparisons) is well taken.
Open Directions - Six Bets on the Future of Agent Systems
The paper identifies six open directions, each representing a real gap between current capability and what the field needs. What follows is an analysis of each direction, informed by current research, with feasibility assessments and cost-benefit evaluations
Direction 1: Bridging the Observability-Evaluation Gap
Feasibility: 7/10 | Timeline: 18–36 months | Impact: High
The field has a fractured relationship with understanding what agents actually do. “Observability” and “evaluation” are treated as a single problem but they are distinct : observability is about understanding what happened (trace, log, record), while evaluation is about determining whether what happened was correct (judge, score, assess). Production agent systems need both simultaneously, but most tooling solves one or the other.
Current state: fragmented. AgentTrace offers structured logging taxonomies. HAL (the agent harness analysis project) has produced the uncomfortable finding that scaffold design explains more variance in agent performance than model choice; yet all major leaderboards compare models. SWE-EVO has further destabilized the field by demonstrating benchmark instability : even frontier models solve 19–21% of problems that a simplified benchmark assigns 65% to, depending on evaluation conditions.
The business model problem is the real blocker. Observability and evaluation infrastructure is expensive to build and maintain, and no company has yet found a compelling revenue model for selling it as a standalone product. It tends to get built as part of a broader platform (Databricks, Azure, AWS all have nascent offerings) but the depth of evaluation tooling available for, say, database query optimization does not yet exist for agent systems.
Cost-benefit : High. Organizations running agents in production without observability and evaluation infrastructure are flying blind. The cost of building this infrastructure is real but the cost of operating without it (failed agents, undetected failures, expensive hallucination cycles) is rapidly becoming the larger line item.
Direction 2: Cross-Session Persistence
Feasibility: 8/10 | Timeline: 12–24 months | Impact: Medium-High
The ambition here is modest but real : agents should remember what happened in previous sessions so that subsequent sessions are not forced to start from zero. The research has converged on a three-tier taxonomy : episodic memory (what happened in this session), semantic memory (what did I learn from this), and procedural memory (how do I do this type of task).
Key implementations are already in the field. MemGPT pioneered the distinction between core memory and archival memory, treating them as different storage tiers with different retrieval costs. Springdrift demonstrated continuous persistent agents as supervised processes : agents that run as long-lived processes rather than session-scoped invocations. MemMachine achieved 93% and 92% on multi-hop retrieval benchmarks using contextualized retrieval.
The core architecture question is largely solved. The remaining engineering problems (schema versioning across memory updates, efficient state restoration, privacy and selective forgetting) are tractable rather than fundamental. The bigger risk is that the problem becomes economically irrelevant : as context windows expand (1M token contexts are now standard, 10M is on the roadmap), the pressure to externalize memory weakens. External memory is valuable primarily because context windows are constrained. If constraints ease, the problem shrinks.
Cost-benefit : Positive near-term. Cross-session persistence is achievable with current engineering and delivers meaningful user experience improvements. The risk is that it becomes a transitional technology rendered obsolete by context window expansion — but “transitional” should not be confused with “unworthwhile.”
Direction 3: Evolving Harness Boundaries
Feasibility: 7/10 | Timeline: 12–24 months | Impact: High
This is where the paper’s earlier finding (98.4% of the codebase is scaffolding) becomes a research agenda. If scaffolds explain more variance than models, we need to understand scaffolds systematically rather than empirically.
SWE-agent is the most compelling data point. The entire SWE-agent implementation is roughly 100 lines of python code. It achieves >74% on SWE-bench, outperforming systems with vastly more complex scaffolding. Live-SWE-agent extends this with a self-evolving runtime : at 79.2% with Claude Opus 4.5, it is competitive with systems that consume an order of magnitude more infrastructure. The implication is uncomfortable for anyone who has invested heavily in complex harness design : simple harnesses can outperform complex ones, and we do not fully understand why.
HAL’s analysis confirms the broader pattern: scaffold choices dramatically impact both accuracy AND cost, yet comparisons across scaffolds are rare in the literature. The field is empirically driven in a domain where empirical results are notoriously fragile to benchmark-specific noise.
Cost-benefit : Potentially very high. Understanding harness design systematically could unlock more performance improvement per dollar than switching foundation models; and it would be available to everyone regardless of which model’s API they use. The cost is primarily research time and the risk is that the field continues treating this as an engineering problem rather than a research one.
Direction 4: Scaling to Scientific Programs
Feasibility: 4/10 | Timeline: 5+ years | Impact: Very High
This is the most ambitious and the most sobering direction in the set. The vision is agents that can conduct full scientific research programs : not just assist with literature review or draft papers, but formulate hypotheses, design experiments, implement them, iterate on failed implementations, and produce validated scientific findings.
Current state : not close. A systematic biorxiv study evaluated eight AI agent frameworks on autonomous scientific research tasks and found that none completed a full research cycle. All produced hallucinations. All failed at robust implementation. The problems are not incremental; they represent a fundamental gap between “useful coding assistant” and “autonomous scientist.”
The core issue is factual grounding in knowledge-intensive domains. A coding agent hallucinating a function call is annoying. A scientific agent hallucinating a molecular mechanism or a statistical relationship is dangerous. Scientific knowledge has a higher truth bar than code : the domain tolerates far less error, and the consequences of error are more severe.
Benchmarks do not help here. Current benchmarks measure isolated task performance on well-defined problems : precisely the conditions that do not hold in actual scientific research. Real science is open-ended, iterative, and requires judgment calls about which results to pursue and which to abandon. Benchmarking this requires benchmarks that do not yet exist.
Cost-benefit : The potential payoff is transformative : autonomous scientific discovery at scale would be one of the most significant technological developments in human history. The cost is also transformative : this requires fundamental research advances, not engineering improvements. The risk-adjusted expected value is positive but the variance is enormous. This is a long-lasting bet appropriate for well-capitalized research organizations, not production engineering teams.
Direction 5: Governance at Scale
Feasibility: 6/10 | Timeline: 18–36 months | Impact: High
When organizations deploy a single agent for a single task, governance is tractable. When they deploy hundreds or thousands of agents performing heterogeneous tasks across departments and jurisdictions, the governance problem becomes really hard. Who is accountable? What are the constraints? How do you enforce constraints when the agent’s action space is large and dynamic?
Current state : early but accelerating. AI Gateway (Databricks), Institutional AI (enforceable constraints via Oracle/Controller patterns), and MI9 (six coordinated runtime mechanisms) represent different approaches to multi-agent governance. GaaS (Governance as a Service) explores black-box governance that operates without requiring model cooperation; an important distinction, since not all agents will be cooperative participants in governance frameworks.
The regulatory pressure is now real. The EU AI Act becomes enforceable in August 2026. The Colorado AI Act takes effect in July 2026. Organizations deploying agent systems at scale will need documented governance frameworks not as a best practice but as a legal obligation. This is no longer an academic concern.
The cost of governance infrastructure is non-trivial : estimates suggest it adds 20–50% to orchestration budgets for large enterprises, which translates to $8–15M annually for organizations at scale. This is a meaningful line item that will disturb the minds of CFOs and CTOs alike.
Cost-benefit : Strong near-term. Regulatory pressure makes governance infrastructure non-optional for organizations operating in EU and US jurisdictions. The cost is high but the cost of non-compliance is higher. This direction is less about technological research and more about engineering implementation of known patterns.
Direction 6: Preserving Long-Term Human Capability Alongside Short-Term Amplification
Feasibility: 5/10 | Timeline: 3–5 years | Impact: Very High
This direction is the most underappreciated in technical circles and the most strategically consequential in the long run. McKinsey estimates $2.9 trillion in annual US economic value from AI augmentation. The question is not whether AI can amplify human capability - it demonstrably can - but whether it can do so while preserving the long-term capability of the humans it augments.
The concern is not abstract. A physician who delegates all diagnostic reasoning to an AI system may become dramatically more productive in the short term and dramatically less capable in the long term. An engineer who uses AI to write all their code may ship more features in the short term and lose the ability to reason about system architecture in the long term. This is not science fiction : it is a well-documented pattern in tool use research.
The WORKBank study of 1,500 workers across 844 tasks found diverse Human Agency Scale profiles : different people respond differently to augmentation, and the factors that predict capability preservation versus capability atrophy are not yet well understood. Early signals suggest skills shift from information-focused to interpersonal as AI handles more information processing : which may be a positive adaptation or may be an erosion of certain cognitive muscles, depending on your perspective.
The most interesting data point may be Andrej Karpathy’s public statements about his own workflow: roughly 16 hours per day expressing intent to AI systems and delegating execution. This is a new mode of human-AI interaction that has no historical precedent, and its long-term effects on human capability are unknown.
The key insight from early research : augmentation (preserving institutional knowledge + eliminating routine work) may generate 2–4x more value than replacement in knowledge-intensive roles. Organizations that understand this will invest in capability-preserving augmentation architectures; organizations that optimize purely for short-term productivity will extract value until the humans they depend on can no longer provide it.
Cost-benefit : Hard to quantify but potentially the most important direction in this list. Unlike the others, it is not primarily a technology problem : it is a human factors and organizational design problem. The technical work is relatively tractable ; the harder work is economic incentive alignment and measurement.
Where to Place Your Bets
The six directions form something of a causal chain. Observability and evaluation are prerequisites for everything else : you cannot improve what you cannot measure (and here I recall an old Dutch sentence : meten is weten), and you cannot govern what you cannot observe. Cross-session persistence enables long-horizon tasks but raises the governance stakes. Scientific programs represent the extreme edge case where all of the above are simultaneously required at maximum intensity. Governance becomes non-optional at scale, and the question of human capability preservation is ultimately the question of whether the whole endeavor serves human flourishing or merely human productivity.
Most actionable near-term : Directions 2 (cross-session persistence) and 3 (harness boundaries). These are primarily engineering problems with working implementations and clear user value. Direction 1 (observability) is also engineering but lacks a sustainable business model for standalone tooling, which slows adoption.
Highest risk : Direction 4 (scientific programs). Current systems cannot complete full research cycles and hallucinate in ways that are dangerous in scientific contexts. This is not an engineering problem : it requires fundamental advances in factual grounding and reasoning under uncertainty.
Most strategically undervalued : Direction 6 (human capability). Almost no technical research attention despite being the difference between AI amplifying human capability and making humans obsolete. Organizations that solve this first will have a durable advantage that cannot be replicated by better models alone.
The framing of “open directions” is appropriate : these are truly open, meaning both that the problems are unsolved and that the solutions, when found, will likely look different from what we currently imagine. The paper deserves credit for identifying real gaps rather than invented ones. The agent systems field has no shortage of impressive demos and insufficient shortage of honest accounting of what remains unsolved. This paper is a great contribution to the latter category.
This analysis is based on arXiv:2604.14228v1 and current research as of April 2026.


