Aleph Zero

The 2 February 2025 EU AI Act Enforcement Cliff

Alexis Gil Gonzales — Sun, 24 May 2026 08:27:19 GMT

2 February 2025 was supposed to be a wake-up call.

On that date, two categories of EU AI Act obligations became legally binding; and most organizations did not notice. The first was a set of categorically prohibited AI practices that no organization can use under any circumstances. The second was an obligation on every provider and deployer of AI systems to ensure their people have sufficient AI literacy. Both have been enforceable for over a year. Both are being actively investigated by national competent authorities. And both are on the short list of what a regulator will ask about first.

If you have not audited your AI systems against the prohibited practices list, you do not know whether you are lawful. If you cannot demonstrate when your staff last received AI literacy training and what that training covered, you cannot demonstrate compliance with Article 4. These are not August 2026 problems. They are February 2025 problems — and the grace period is over.

Part 1: Prohibited AI Practices

Article 5 of the EU AI Act categorically prohibits eight types of AI system. These prohibitions are absolute — they apply regardless of risk classification, regardless of sector, and regardless of whether the system is high-risk or minimal risk. If your system falls into any of these categories, it is prohibited. There is no conformity assessment path, no CE marking route, no derogation. It is simply not allowed.

Screening all deployed and in-development AI systems against these eight categories should be the first item in any EU AI Act compliance workstream. It is remarkable how few organizations have done it.

Article 5(1)(a) : Subliminal techniques beyond awareness. AI systems that manipulate persons through subliminal mechanisms causing physical or psychological harm are prohibited. The “beyond awareness” threshold is narrow : systems that use persuasive but visible techniques are not captured. Harm is the key determinant. If your recommender system uses subliminal pattern nudging that users cannot consciously perceive and that causes them to behave in ways they would not otherwise choose, it may be in scope.

Article 5(1)(b) : Exploiting vulnerabilities. AI systems that exploit age, disability, social or economic situation to distort behaviour in ways that cause harm are prohibited. “Exploiting” requires intent to cause harm — not merely the presence of a vulnerability. Systems that do not specifically target vulnerability for distortion purposes are not captured. But a dark pattern designed to target compulsive gamblers, or a system that exploits cognitive vulnerabilities in elderly users to drive harmful purchasing behaviour, is a clear example of what this provision captures.

Article 5(1)(c) : Social scoring. AI systems that evaluate or classify natural persons based on social behaviour or personal characteristics across multiple contexts, resulting in detrimental treatment, are prohibited. Differential treatment based on social behaviour across multiple contexts is the threshold. A credit scoring system that uses financial behaviour is not captured. A system that layers social media behaviour onto lending decisions and produces materially worse outcomes for a demographic group is.

Article 5(1)(d) : Criminal risk profiling beyond scope. AI systems that profile natural persons for risk assessment in a criminal context beyond the scope of a specific investigation are prohibited. The distinction from legitimate investigation support is the key boundary. A system that helps an officer identify whether a specific person is a suspect in a specific open investigation is not prohibited. A system that scores a person’s general “criminal risk” across unrelated contexts and feeds those scores into sentencing, parole, or pre-trial detention decisions is.

Article 5(1)(e) : Emotion recognition in workplace and education. AI systems that infer emotional states in workplace or educational institutions are prohibited, with narrow exceptions for medical safety uses and age verification for minors. A performance management system that infers frustration from keystroke patterns or facial expressions in a call centre is prohibited. A system that detects pain indicators in a medical context, or one that verifies age to restrict access to age-inappropriate content, is not.

Article 5(1)(f) : Biometric categorization for sensitive attributes. AI systems that use biometric data to infer race, religion, sexual orientation, disability status, or migration status are prohibited. This prohibits inferential categorization; it is not a prohibition on processing biometric data generally. A facial recognition system that infers ethnicity to filter for demographic parity in a hiring process is prohibited. A fingerprint system used purely for biometric authentication is not.

Article 5(1)(g) : Real-time remote biometric identification in public spaces. AI systems for real-time biometric identification of individuals in public spaces for law enforcement purposes are prohibited, with very narrow exceptions. Those exceptions require: judicial or independent authority authorization, a pre-existing suspect list, temporal and geographic limitations, and human oversight. A body camera that flags a face against a watchlist in real time without all of these conditions met is almost certainly prohibited.

Article 5(1)(h) : Manipulation through nudging. AI systems that manipulate human behaviour through nudging techniques exploiting cognitive vulnerabilities in ways that cause significant harm are prohibited. “Nudging” in this context means overriding autonomous decision-making. The harm threshold is high : routine personalisation and recommendation systems are not captured. But a system that deliberately exploits impulse control disorders or cognitive biases to drive addictive behaviour in ways that cause significant harm to users is.

These eight categories are mutually exclusive and collectively exhaustive. If an AI system falls into any one of them, it is prohibited. There is no conformity assessment path, no CE marking route, no derogation. The only correct response is to not build or deploy that system.

Part 2: AI Literacy - The Most Overlooked Obligation

Article 4 is the most consistently underestimated obligation in the entire EU AI Act. It has been binding since 2 February 2025. It applies to every organization that places AI systems on the EU market or deploys AI systems in EU operations; without exception, without size threshold, and without a gradual implementation period.

Who is covered ? Every provider of AI systems and every deployer of AI systems. “Provider” means any organization that develops an AI system and places it on the market or puts it into service under its own name. “Deployer” means any organization that uses an AI system in its own operations. If your company has an AI system in production (a customer-facing model, an internal tool, an automated decision system) you are a deployer. You are covered.

What is sufficient AI literacy ? The regulation does not prescribe a fixed curriculum or a minimum number of training hours. It defines AI literacy by reference to the organization’s use case and risk profile. Staff working with or alongside AI systems must understand : how the system works, what it can and cannot do, how to oversee its operation, and what incidents to report. That understanding must be documented. And it must be kept current as systems change.

The critical evidence is not whether you ran a training session. It is whether you can show what the training covered, who attended, when it occurred, and how it was updated when the system changed.

Why documentation is the compliance evidence ? A regulator will not ask whether your staff are “generally AI literate.” They will ask: which AI systems are in scope, which personnel groups interact with them, what training did those groups receive, how did you verify sufficient understanding, and when was it last updated. If you cannot answer each of those questions with a controlled document, you cannot demonstrate compliance.

This is the failure mode we see consistently : organizations that have done work in this area but cannot produce evidence that maps to those specific questions. A training attendance sheet that shows a one-hour session on “AI basics” does not demonstrate that the team understands the specific system they operate, its limitations, their oversight role, or the incident reporting path. It demonstrates only that they sat in a room.

What AI Literacy Looks Like

AI literacy is not a generic concept. It is specific to the systems people work with and the decisions they make alongside those systems. The difference between compliant and non-compliant AI literacy is operational specificity.

A warehouse operations team using an AI-powered inventory management system understands what the model does : it predicts demand spikes based on seasonal patterns and supplier lead times. They know what it cannot do : it cannot account for sudden macroeconomic shifts, trade disruptions, or pandemic-era demand anomalies. They know what their oversight role is : they review and override system-generated purchase orders above a defined threshold, and they know what that threshold is. They know what to escalate : when the system consistently under-orders for a specific product category, when demand patterns shift outside the model’s training distribution, when a supplier fails and the model does not adapt. All of this is documented in their operating procedures and reflected in their training records.

A customer service team deploying an AI chatbot knows what the chatbot can resolve autonomously and what it cannot : it can handle password resets and order status queries, it cannot handle contractual disputes or emotional distress calls. They understand when to escalate : the routing logic is visible to them, the escalation triggers are defined, and they know what “confidently wrong” looks like in their specific context. They know what data the system accesses : it reads the customer record and order history, not internal financial data. When the chatbot fails to escalate a case that it should have, they know the incident reporting path. This is operational AI literacy.

A software engineering team integrating a GPAI model via API has reviewed the model’s capability and limitation documentation : the model card, the acceptable use policy, the known limitations list. They understand the boundaries of what the provider has committed to versus what the team is responsible for. They know what constitutes a model failure in their context : when the model’s outputs consistently misalign with their internal knowledge base, when the model’s confidence calibration is wrong for their domain. They know the serious incident reporting path : they have documented what they would report to the provider and under what timeline, and they know that certain failures may trigger obligations under Articles 11 and 17.

A compliance team monitoring AI systems in production understands the logging mechanisms available for each system : what each system records, what the log format is, how to query it. They know what constitutes a reportable incident under the Act : a system failure that causes harm or that could have caused harm if not caught. They know the documentation retention requirements : how long logs must be preserved, which incidents require formal documentation versus informal tracking. They can demonstrate, within 24 hours of a regulator’s request, evidence of their post-market monitoring activities for any given system.

What AI Literacy Is Not

AI literacy is not a one-hour e-learning module on “What is Artificial Intelligence” completed once in 2023 and never updated. It is not a company-wide Slack message stating that employees should “use AI responsibly.” It is not a generic “Introduction to Machine Learning” course that covers neural networks and gradient descent without ever touching the systems your teams actually operate. And it is not a policy document that defines AI literacy in abstract terms without specifying what it means for each role and each system in your environment.

These are not hypothetical failures. They are the actual documentation we see when organizations first conduct their AI literacy gap analysis. The obligation has been binding for over a year with this level of specificity required. Many organizations are not close.

What Compliance Actually Requires Right Now

For prohibited practices : audit every deployed and in-development AI system against the eight categories before your next quarterly risk review. If any system falls into a prohibited category, the compliance path is not a documentation exercise; it is a product decision. Those systems either get redesigned outside the prohibited scope or they get retired.

For AI literacy : identify every AI system in your environment, map it to the personnel groups that interact with it, assess what each group needs to understand about that specific system, deliver and document training against those requirements, and establish a trigger for when that training must be refreshed. The trigger is always a system change, but annual refresh is a minimum baseline.

Both obligations have been enforceable since February 2025. National competent authorities are building enforcement caseloads. The organizations that have addressed these obligations face a regulator’s question. The organizations that have not face penalties.

AI Literacy Compliance Evidence Template

The AI Literacy Compliance Evidence Template is a structured document that maps your AI systems to personnel groups and generates the documentation trail required to demonstrate Article 4 compliance. It covers: system mapping, training topic documentation, evidence of delivery, acknowledgment records, and change-triggered refresh triggers.

Use it to close the gap before a regulator asks the question.

Ailiteracycomplianceevidencetemplate Alephtechnologies

133KB ∙ PDF file

Download

Next: Agentic AI and the EU AI Act - The Governance Gap No One Is Addressing

GPAI Models and the 10²³ FLOP Threshold

Alexis Gil Gonzales — Wed, 20 May 2026 14:13:43 GMT

If you are building a platform that uses a general-purpose AI model, whether you are integrating a frontier model via API, fine-tuning an open-source model for your product, or wrapping a foundation model in a vertical application, the EU AI Act creates specific obligations that fall on your model provider, not on you. That is the key sentence in the entire GPAI regulatory framework, and it is the one that most platform builders have not read carefully enough.

Understanding who bears which obligations, what your providers are required to disclose, and what systemic risk means for the models you are building on is not optional due diligence. It is the prerequisite for making defensible architecture decisions about how to integrate GPAI into your platform without inheriting obligations that belong to someone else.

This is the fourth article in our ten-part series on AI governance. If you have not yet read The Seven Mandatory Requirements for High-Risk AI Systems, that article covers the full set of obligations that apply to high-risk system providers. This article focuses on a distinct regulatory category : GPAI models; and explains how the obligations attach differently depending on whether you are a model provider, a platform builder, or a deployer.

GPAI Is a Distinct Regulatory Category

The EU AI Act treats GPAI models as a separate regulatory object from GPAI systems. This distinction is deliberate and consequential.

A GPAI model is a foundation model : a machine-trained system that produces outputs from inputs and can be adapted to a wide range of downstream applications. GPT-4, Claude, Gemini, Llama, Mistral, and their equivalents are all GPAI models. The model itself is the regulated object.

A GPAI system is what you get when a GPAI model is integrated into a specific application with a defined intended purpose. The platform you are building, the product you are shipping, the application your customers use : that is a GPAI system.

The regulatory implication is straightforward but important: the provider of the model bears the model-level obligations. The builder of the platform or application bears system-level obligations only if the application is high-risk or if they have substantially modified the underlying model. If you are building an application that calls a frontier model’s API (without retraining, fine-tuning, or otherwise changing the model itself) you are a GPAI system builder, not a GPAI model provider. The obligations under Articles 53 and 55 sit with whoever trained and released the model.

This distinction determines which obligations are yours and which are your upstream provider’s.

Article 53 - Obligations for All GPAI Providers

Article 53 imposes four obligations on all providers of GPAI models, regardless of model size, compute footprint, or market share. These are baseline obligations that apply to every organization training and releasing a GPAI model.

1. Technical Documentation

GPAI model providers must produce and maintain technical documentation that describes the model’s architecture, training approach, capability evaluation, known limitations, and intended use cases. The format is specified by the GPAI Code of Practice’s Model Documentation Form : providers do not need to invent their own template.

The documentation must be kept current and must be available to downstream deployers who need it to assess whether the model is suitable for their intended application. This is the primary mechanism by which downstream builders get the information they need to make informed integration decisions.

Practical implication for platform builders : Before you integrate a GPAI model, ask your provider for their technical documentation package. If the provider cannot produce this, that is a signal about their compliance posture — and about the risk you are assuming by integrating their model without adequate documentation.

2. Copyright Compliance Policy

Providers must document the measures taken to identify and comply with rights reservations from training data. The EU AI Act does not require providers to disclose what training data was used. It requires them to demonstrate that they have a process for respecting copyright reservations.

This is a process documentation obligation, not a disclosure obligation. A provider that has a documented copyright compliance process (one that examines training data sources, respects opt-out requests, and logs the steps taken) satisfies Article 53(1)(b) even if they do not publish the training data itself.

For platform builders : Ask your model provider to describe their copyright compliance policy. The question is not what data they trained on : it is what process they use to respect rights reservations. Providers who cannot answer this question have not built the compliance infrastructure that Article 53 requires.

3. Training Data Summary

Providers must publish a summary of the training data characteristics. This is a factual summary; not a full data dump, not a dataset release. The GPAI Code of Practice provides templates for this summary. The purpose is to give downstream deployers enough information to assess whether the model’s training data is appropriate for their use case.

The summary covers the type, volume, and nature of training data; the geographical and temporal scope; and any known gaps or limitations. It is deliberately high-level; it is not a substitute for a full data sheet, but it is the minimum that downstream users need to evaluate a model.

4. Downstream Information Duties

Providers must supply downstream deployers with information that enables them to : understand the model architecture, comply with acceptable use policies, assess known limitations, and evaluate performance factors relevant to the deployer’s intended use.

This obligation is continuous : it applies as long as the model is in service. If a provider updates the model in a way that materially affects performance or safety characteristics, they must update the information they provide to downstream deployers.

Article 55 - Additional Obligations for Systemic Risk Models

Article 55 adds a second tier of obligations for GPAI models that meet the systemic risk threshold. These obligations are more demanding, and the systemic risk threshold is calibrated to capture essentially all frontier models.

The 10²³ FLOP Threshold

A GPAI model is presumed to have systemic risk if the cumulative compute used for training exceeds 10²³ FLOP (floating-point operations). This is a cumulative threshold : it measures the total compute used across the entire training run, not the compute used for any single training step.

The threshold is set at a level that covers the models that trained above it. As of 2025, every major frontier model (GPT-4 class and above) exceeds 10²³ FLOP. The models that do not exceed it are generally smaller open-source models trained on limited compute budgets.

The threshold is not a size categorization; it is a risk categorization. Models above the threshold are not automatically dangerous. They are presumed to pose systemic risk because of their capability profile, which creates the potential for widespread harm if they are misused or if they fail in consequential domains. The additional Article 55 obligations are the regulatory response to that risk presumption.

Article 55 Obligations

For providers of systemic risk GPAI models, the following additional obligations apply :

Adversarial testing. Providers must conduct testing against state-of-the-art adversarial threats before market placement. This means red-teaming the model for capability misuse, emergent dangerous behaviors, and attack vectors specific to the model’s modality.

Safety and Security Framework. Providers must maintain and report a documented framework to the AI Office. This means describing the model’s safety architecture, testing methodology, incident response process, and governance structure for ongoing safety monitoring.

Safety and Security Model Report. Providers must submit a report to the AI Office before market placement. This means documenting model capabilities, known risks, mitigation measures, and residual risk acceptance decisions.

Serious incident reporting. Providers must report serious incidents to the AI Office. This means defining what constitutes a serious incident for your model and establishing and testing an escalation path to AI Office reporting.

The GPAI Code of Practice

The GPAI Code of Practice was published on 10 July 2025 and endorsed by the Commission on 1 August 2025. It provides the templates, methodologies, and procedural guidance that translate the legal obligations in Articles 53 and 55 into implementable documentation requirements.

Signing the Code of Practice is voluntary. Providers who sign it receive a presumption of compliance; in any enforcement action, their adoption of the Code’s guidance is a mitigation factor. Providers who do not sign must demonstrate compliance through alternative means and must report to the AI Office how they intend to satisfy their obligations without the Code’s template support.

The practical reality : The GPAI Code of Practice is the de facto compliance path for frontier model providers. The templates reduce the documentation burden substantially. Non-signers must build equivalent documentation from scratch. Given that the Code was developed with broad industry input and is the reference point that conformity assessment bodies will use, not signing is a decision that requires an explicit alternative compliance strategy.

What Platform Builders Must Confirm

If you are building a platform or product that integrates a GPAI model, your regulatory exposure as a system builder is determined by two questions :

(1) are you substantially modifying the model, and

(2) if the model is above the systemic risk threshold, is your upstream provider meeting their Article 55 obligations?

Substantial modification (retraining on new data, changing the model’s architecture, altering its capability profile in ways that create new risk vectors) can shift you from system builder to model provider. The threshold for substantial modification is not precisely defined in the regulation, but the principle is clear : if you are changing the model fundamentally, you are responsible for the model-level obligations.

For platforms that integrate unmodified frontier models, the obligation structure is :

What you are responsible for as a platform builder :

High-risk system classification : If your platform’s application falls under Annex III (high-risk categories), you bear the full set of Articles 9–17 obligations as a provider
Fundamental Rights Impact Assessment : For high-risk applications affecting fundamental rights, conducted before first deployment
Human oversight : Article 14 obligations for the application layer (your platform must enable effective human oversight of the application’s outputs)
Post-market monitoring : Article 17 obligations for your application (you must actively monitor how your application performs and update your risk records when new risks emerge)
GPAI system registration : Your application must be registered in the EU AI database before market placement

What your GPAI model provider is responsible for :

Model-level technical documentation (Article 53)
Copyright compliance process (Article 53)
Training data summary (Article 53)
Downstream information provision (Article 53)
For systemic risk models: adversarial testing, Safety and Security Framework, AI Office reporting (Article 55)

The documentation chain : Your conformity declaration for your platform references the model provider’s technical documentation. If the model provider’s documentation is incomplete, your ability to demonstrate conformity at the platform level is compromised. Due diligence on your upstream provider’s compliance infrastructure is not just good practice; it is a component of your own compliance posture.

The Systemic Risk Reality

The 10²³ FLOP threshold is not controversial because it is set too low; it is controversial because it is set at a level that captures nearly every frontier model that organizations are building on. This is intentional. The EU regulatory framework’s designers understood that models at the capability level of GPT-4 and Claude carry systemic risk by virtue of their capability profile, regardless of the intent of the organizations deploying them.

What this means for platform builders is concrete : every time you integrate a frontier model’s API and expose it to your users, you are building on infrastructure that is subject to Article 55 obligations. Your provider (OpenAI, Anthropic, Google, Meta, Mistral, or whoever is training the model) bears those obligations. But your platform’s risk management must account for the possibility that your provider’s compliance infrastructure fails.

The question to ask your GPAI provider is not just “are you compliant?” It is “can you produce your technical documentation, your copyright compliance policy, your training data summary, and (if you are above the systemic risk threshold) your adversarial testing results and Safety and Security Framework report?”

If the answer to any of those questions is “no” or “we are working on it,” your platform’s compliance posture is built on a foundation you cannot rely on.

The Documentation You Should Have

For every GPAI model you integrate, you should have on file :

The provider’s technical documentation (Article 53)
The provider’s copyright compliance policy statement
The provider’s training data summary
Documentation of the provider’s acceptable use policy and your platform’s alignment with it
If the model is above the systemic risk threshold : the provider’s adversarial testing evidence and Safety and Security Framework (or an equivalent alternative)
Your own model integration assessment : what changes you made (if any), what the model’s outputs are used for in your application, and how your application maps to the Article 9–17 requirements if it is high-risk

This documentation does not need to be disclosed publicly. It needs to be retrievable. If a regulator asks to see your evidence of GPAI provider due diligence within 24 hours, and you cannot produce it, the gap is not in the provider’s compliance : it is in your own documentation infrastructure.

Next: The 2 February 2025 Enforcement Cliff — Prohibited Practices and AI Literacy Already Binding

Download our free GPAI Compliance Readiness Checklist for Platform Builders

Gpaicompliancereadinesschecklist Alephtechnologies

192KB ∙ PDF file

Download

The Seven Mandatory Requirements for High-Risk AI Systems

Alexis Gil Gonzales — Sun, 17 May 2026 06:01:31 GMT

Many organizations that believe they understand their EU AI Act obligations for high-risk AI systems have not read Articles 9 through 17 carefully enough to discover how far they are from satisfying them.

The assumption is quite common : a policy document exists, a risk assessment was conducted, the AI system is in deployment. Compliance is presumed. But the EU AI Act does not require compliance at the level of policy language : it requires compliance at the level of evidence architecture, documentation discipline, and operational process. A policy document is not evidence. A risk assessment conducted once at design time is not an iterative risk management system. A deployment that no one monitors post-market is not post-market monitoring.

This is the third article in our ten-part series on AI governance. If you have not yet read Why the EU AI Act Applies to You, that article provides the necessary context for understanding who is in scope and which role obligations apply. This article focuses on what high-risk system providers must actually demonstrate to be compliant; and why the gap between what organizations have done and what the regulation requires is, in many cases, very large.

The Integration Problem

Articles 9 through 17 are commonly discussed as seven separate requirements. This framing is misleading. They form an integrated system : a gap in one creates a compliance gap in others, because the evidence for each requirement connects to the evidence for the others.

A risk management system (Article 9) that does not produce documented outputs cannot support a post-market monitoring procedure (Article 17) that requires updating risk records when new risks emerge. A technical documentation system (Article 11) that is not updated when the model changes means the conformity declaration that references that documentation is inaccurate. A QMS (Article 16) that does not have a change management process means every system update creates a documentation compliance violation.

The requirement integration is not optional architecture; it is the compliance architecture. Organizations that address each article as a standalone checklist have built compliance scaffolding that collapses under regulatory scrutiny.

Article 9 - Risk Management System

The risk management system is the foundation of the entire compliance structure. Article 9 requires providers to establish, document, and maintain a risk management system that operates throughout the lifecycle of the high-risk AI system.

What “iterative” means in practice : The most significant word in Article 9 is “iterative.” A static risk assessment conducted at design time does not satisfy the requirement. The risk management process must continue post-deployment as new risks emerge, as the deployment context changes, and as the system evolves. A risk record opened during design must show evidence of review and update at deployment and at post-market phases. Residual risks (those that remain after mitigation measures are applied) must be formally accepted and documented.

In practice, this means many organizations without an existing quality management system have no documented risk management process for AI systems. Opening a risk record and keeping it updated through the system lifecycle requires process infrastructure that typically has a six-to-twelve month lead time to implement properly.

What must be documented :

Risk identification methodology
Risk evaluation criteria and acceptance thresholds
Mitigation measures applied and their effectiveness
Residual risk decisions and who approved them
Risk record updates triggered by system changes or post-market data

Article 10 - Data Governance

Article 10 requires that training, validation, and testing data meet four criteria : relevance, representativeness, error-free, and completeness. Appropriate data governance mechanisms must be established.

The bias examination requirement is explicit and non-discretionary. The Act requires providers to examine whether training data may introduce bias : this is not a discretionary best practice. Providers must document the bias examination methodology, findings, and any mitigation measures applied. A data governance process that does not produce this documentation does not satisfy Article 10.

Special category data (Article 10(3)) : When training data contains special category data under GDPR Article 9 (health data, biometric data, genetic data, race or ethnicity, political opinions, religious belief, sexual orientation) providers must implement additional safeguards. This creates a GDPR-EU AI Act intersection that requires joint legal review. The two regulations interact at this point and organizations that treat them as separate compliance tracks frequently discover gaps in their overlapping obligations.

Article 11 - Technical Documentation

Technical documentation must be drawn up before the system is placed on the market and kept up-to-date. Article 11 specifies this obligation; Annex IV defines what the documentation must contain.

The technical documentation must include, at minimum :

System description and intended purpose
System architecture : design decisions, algorithms, training approaches
Training methodology and training data characteristics
Testing procedures and results
Known limitations and acceptable risks
Cybersecurity measures
Post-market monitoring plan

The documentation update obligation is continuous

Many organizations treat technical documentation as a one-time deliverable at launch. The Act requires documentation to be “kept up-to-date”; meaning every significant system change must be reflected in the documentation. This requires a documentation change management process that is itself part of the QMS (Article 16).

Evidence for conformity assessment

Technical documentation is the primary evidence used by conformity assessment bodies to verify compliance. Poor documentation creates a presumption of non-compliance. The documentation is not a formality, it is the compliance record.

Article 12 - Automatic Logging

High-risk AI systems must be designed and developed to enable automatic logging of events during the system’s lifecycle. Logs must be retained for a minimum of six months.

What must be logged : Events during operation that are relevant for monitoring system behavior, detecting potential issues, investigating incidents, and maintaining compliance with other requirements. The scope of “relevant” is defined by the other requirements : what you need to investigate a serious incident (Article 72), demonstrate human oversight was maintained (Article 14), and verify system performance against documented metrics (Article 17) determines what your logging system must capture.

Six-month minimum retention is a floor, not a ceiling. The clock starts from the moment of the event, not from system deployment. In practice, organizations with related legal holds (litigation, regulatory investigations, product liability) should extend retention accordingly. The six-month floor exists because regulators want to be able to examine recent events. If you have active matters that require longer retention, the longer period governs.

Log integrity : Deployers are responsible for log integrity. Logs must be tamper-evident and retained in a manner that preserves their evidentiary value. A logging system that can be modified after the fact without leaving a record does not satisfy the requirement.

Article 13 - Transparency to Users

High-risk AI systems must be designed and developed to enable deployers to understand the system’s output. Instructions for use must include the intended purpose, accuracy metrics and limitations, human oversight measures, and maintenance requirements.

Three key technical implications :

Output interpretability : The system must produce outputs that the deployer can understand. This does not require explainability of model internals : it requires that the system’s behavior is comprehensible in the context of use. A credit scoring system that produces a score without explaining the factors that drove it has a transparency problem even if the underlying model is sound.
Known limitations disclosure : Providers must document and communicate known accuracy limitations. This is distinct from a general obligation to disclose all possible limitations. “Known” means limitations discovered during development and testing : the ones the provider is aware of at the time of deployment.
Intended purpose as a technical boundary : The instructions for use must clearly define the intended purpose. If a deployer uses the system outside that intended purpose, the provider’s conformity declaration does not cover that use. This creates a documentation incentive to be precise about intended purpose, and a deployer incentive to respect the documented boundary.

Article 14 - Human Oversight

High-risk AI systems must be designed and developed to allow natural persons to effectively oversee, monitor, and intervene; including the ability to decide not to use or to stop the system.

The oversight requirement has three tiers, all of which must be satisfied :

Ability to understand : The human overseer must be able to understand what the system is doing and why. This implies output explanations or decision context that allows meaningful human review.
Ability to intervene : The human overseer must have the technical means to override, stop, or redirect the system. A human who can see what the system is doing but cannot change its behavior has not fulfilled Article 14.
Authority to decide not to use : The human overseer must have the organizational authority to refuse a system’s recommendation or output. Technical capability without organizational authority does not constitute effective oversight.

For most high-risk AI systems, human oversight must be achieved through one or more of :

Output review before action (human-in-the-loop)
Override mechanisms with clear escalation paths
Real-time monitoring dashboards with alert thresholds
Kill switch or emergency stop functionality
Decision documentation and audit trail

The “effective” standard is the critical test : Oversight that exists on paper but is practically inaccessible does not satisfy the requirement. A human overseer who cannot realistically review system outputs before they take effect has not fulfilled Article 14. If your oversight mechanism requires someone to be physically present at a screen at the moment a consequential decision is made, and that person is also doing twelve other things, your oversight is not effective.

Articles 15, 16, and 17 - The Continuous Compliance Loop

These three articles form a continuous loop : the system must perform accurately and robustly (Article 15), the organization must manage quality systematically (Article 16), and the organization must monitor what happens after deployment and respond when the risk picture changes (Article 17).

Article 15 - Accuracy, Robustness, and Cybersecurity

High-risk AI systems must achieve appropriate levels of accuracy, robustness, and cybersecurity, and perform consistently in relation to their intended purpose over their lifecycle.

The three dimensions :

AI-specific cybersecurity threats include data poisoning (corrupting training data to alter model behavior), model evasion (adversarial inputs that cause misclassification), model inversion (extracting training data from model outputs), membership inference (determining whether specific data was in the training set), and for agentic systems : prompt injection (external content that alters system behavior). Article 15 requires protection against these threat classes; not just generic cybersecurity.

Article 16 - Quality Management System

Providers of high-risk AI systems must implement a quality management system (QMS) that ensures and demonstrates compliance with the regulation.

The QMS maps to ISO/IEC 42001:2023. Organizations with ISO 42001 certification are presumed to satisfy Article 16. This is the primary regulatory bridge between the EU AI Act and the broader ISO governance ecosystem : if you already have ISO 42001 certification, you have the structural foundation for Article 16 compliance.

Minimum QMS components :

Document control for technical documentation
Change management process (for model updates, deployment changes)
Risk management documentation process
Post-market monitoring procedure
Non-conformity and incident response process
Internal audit program
Management review process

The QMS is the system that ties the other requirements together. Organizations without a QMS that attempt to comply with Articles 9–15 individually typically build parallel compliance tracks that are expensive to maintain and difficult to audit. The QMS is not an additional requirement on top of the other requirements : it is the structural framework that makes the other requirements sustainable.

Article 17 — Post-Market Monitoring

Providers must establish and maintain a post-market monitoring system to collect and document data about the AI system’s performance and risks throughout its lifecycle.

The active monitoring requirement is non-negotiable. A passive post-market monitoring system (one that collects complaints and waits for problems to be reported) does not satisfy Article 17. Active monitoring means proactively measuring system performance against defined KPIs, instrumenting the system to emit performance data, and updating risk records when new risks are identified.

The Article 72 intersection : Article 72 requires serious incident reporting to national market surveillance authorities. Your post-market monitoring system must have a defined escalation path that triggers Article 72 reporting when incidents meet the serious incident threshold. If your post-market monitoring system does not have a clear definition of what constitutes a serious incident and a documented path to regulatory reporting, it does not satisfy either Article 17 or Article 72.

The Compliance Architecture Is the Point

The integration of these requirements is not an academic observation : it is the practical reality of EU AI Act compliance. An organization that has addressed each article separately has not built a compliant system. An organization that has built the integrated compliance architecture ( where risk management feeds documentation, documentation feeds conformity assessment, post-market monitoring feeds risk management, and the QMS holds the whole structure together) has something that regulators can audit and that provides real protection.

The enforcement authorities for high-risk system obligations become fully operational on 2 August 2026. The window to build this architecture properly (with evidence that can be retrieved, organized, and presented) is narrow. Organizations that are treating this as a documentation exercise will discover that documentation without process infrastructure does not survive regulatory scrutiny.

Next: GPAI Models and the 10²³ FLOP Threshold — What Every Platform Builder Must Know*

Download our free High-Risk Requirements Gap Analysis & Self-Assessment Templates for Providers and Deployers.

High Riskrequirementsgapanalysistemplate Providers Alephtechnologies

276KB ∙ PDF file

Download

High Riskrequirementsgapanalysistemplate Deployers Alephtechnologies

221KB ∙ PDF file

Download

Why the EU AI Act Applies to You

Alexis Gil Gonzales — Fri, 15 May 2026 06:58:46 GMT

The most dangerous assumption in EU AI Act compliance is the one that organizations never examine closely enough to challenge: that the regulation only applies to companies headquartered in the EU.

This assumption is wrong. It is wrong in a way that could expose organizations to significant liability; and it is wrong in a way that is surprisingly easy to fall into, because the extraterritorial scope of the EU AI Act is not obvious from the regulation’s opening articles. You have to read carefully to understand it, and most organizations have not.

This is the second article in our ten-part series on AI governance. If you have not yet read The Fragmented World of AI Governance, that article provides the necessary context for understanding why the frameworks exist and how they relate to the EU AI Act. This article focuses specifically on who the EU AI Act applies to; and the answer is broader than many legal teams expect.

The Extraterritorial Architecture

The EU AI Act applies to three categories of organization, not two.

Providers are organizations that develop AI systems and place them on the market or put them into service in the EU; regardless of where those organizations are established. If you build an AI system anywhere in the world and you sell or provide that system to organizations or individuals in the EU, you are a provider under the EU AI Act. You bear the full set of provider obligations, including the requirements for high-risk AI systems (Articles 9–17) and, if your model exceeds the systemic risk threshold, the GPAI model obligations (Articles 53–55).

Deployers are organizations or individuals that use AI systems under their authority within the EU. If your company has operations in the EU (sales offices, distribution centers, subsidiaries, customer relationships) and you are using AI systems in those operations, you are likely a deployer. Deployer obligations are distinct from provider obligations, and they include requirements around human oversight, input data relevance, log retention, and incident reporting.

Importers and distributors complete the chain. Importers bring AI systems from outside the EU into the EU market. Distributors make AI systems available in the EU market. Both categories carry specific obligations.

The key point for organizations outside the EU is this: the place where your headquarters sits is not the determining factor. The determining factor is where you place AI systems on the market or where you deploy AI systems in the EU. An American company selling AI-powered software to European customers is a provider. A British company running an AI system in its German operations is a deployer. Neither can claim exemption based on the location of their registered office.

The Provider/Deployer Distinction Matters Enormously

Most organizations that touch AI are simultaneously providers and deployers; and the distinction determines which obligations apply to which systems.

Provider obligations are the most demanding. If you are building or modifying an AI system, you are its provider. Substantial modification of a third-party AI system (changing its intended purpose, retraining it on new data, integrating it so fundamentally that it becomes a different product) can make you a provider of what was originally someone else’s system. The provider is responsible for conformity assessment, technical documentation, post-market monitoring, and (for high-risk systems) the full set of Articles 9–17 requirements.

Deployer obligations apply to organizations using AI systems under their authority. If you are using a vendor’s AI system in your operations, you are a deployer - not a provider - unless you have substantially modified it. Deployer obligations include ensuring human oversight is assigned to competent persons, monitoring the system in line with instructions for use, retaining logs for at least six months, and (for high-risk systems affecting fundamental rights) conducting a Fundamental Rights Impact Assessment before first deployment.

The practical problem: many organizations have not catalogued which of their AI systems make them a provider versus a deployer, and they have not recognized that the same system can generate both roles simultaneously if they are both using it and reselling or modifying it.

The Enforcement Timeline Is Already Running

The enforcement timeline for the EU AI Act is not a future concern. Parts of it are already in force.

Since 2 February 2025, two obligations have been binding on all providers and deployers; not just high-risk system providers, not just EU-established organizations, but all providers and all deployers of AI systems anywhere in the world who are covered by the Act’s extraterritorial scope.

Article 5 prohibited practices became enforceable on that date. Eight categories of AI system are categorically prohibited in the EU market : systems that use subliminal manipulation, exploit vulnerabilities, implement social scoring, perform criminal risk profiling beyond specific investigation scope, use emotion recognition in workplaces and educational institutions, categorize individuals by sensitive biometric attributes, conduct real-time biometric identification in public spaces for law enforcement (with narrow exceptions), or manipulate human behavior through nudging techniques that override autonomous decision-making and cause significant harm.

Article 4 AI literacy also became enforceable on 2 February 2025. Every provider and every deployer must ensure that staff working with AI systems have sufficient AI literacy to understand what the system does, what its limitations are, how to oversee it, and what incidents to report. This obligation has no size threshold. It applies to every organization covered by the Act, regardless of headcount.

The enforcement authorities for high-risk system obligations become fully operational on 2 August 2026. The GPAI model obligations (Articles 53–55) became enforceable for newly placed models on 2 August 2025. The GPAI Code of Practice was published in July 2025 and endorsed by the Commission in August 2025 : providers signing it receive a presumption of compliance, which is effectively a mitigation factor in any enforcement action.

The Fine Structure

Non-compliance carries material consequences. For high-risk violations (including non-conformity with Articles 9–17) fines reach €35 million or 7% of global annual turnover, whichever is higher. For GPAI model violations (non-systemic risk): €15 million or 3% of global annual turnover. For incorrect or misleading information: €7.5 million or 1% of global annual turnover. For prohibited practice violations: the same as high-risk violations, €35 million or 7%.

The global turnover measurement is deliberate. It means the fine is calculated on worldwide revenue, not EU revenue; making the financial exposure for large non-EU organizations very significant.

What This Means Practically

The organizations that are most exposed to EU AI Act liability are not necessarily the ones headquartered in Brussels or Berlin. They are the ones building AI systems that are sold or deployed in the EU; and that includes most technology companies with international operations or customer bases.

The first step is determining whether you are in scope at all. That requires an honest examination of: whether you place AI systems on the EU market (as a provider or distributor), whether you deploy AI systems in EU operations (as a deployer), and whether any of those systems are classified as high-risk under Annex III or whether any of them involve profiling of individuals.

If you are in scope, the second step is determining which obligations apply to which systems; and that requires understanding the provider/deployer distinction at a level of detail that many organizations have not yet considered.

The window to close these gaps before enforcement accelerates in August 2026 is narrow. The organizations that are treating this as a compliance exercise rather than a governance capability building exercise will find themselves doing crisis remediation when they should be building sustainable programs.

Next: The Seven Mandatory Requirements for High-Risk AI Systems - A Technical Breakdown.

Download our free EU AI Act Applicability Assessment.

Euaiactapplicabilityassessment Alephtechnologies

126KB ∙ PDF file

Download

The Fragmented World of AI Governance

Alexis Gil Gonzales — Sun, 10 May 2026 09:18:30 GMT

The organizations that will lead in AI governance over the next five years are not the ones waiting for a dominant standard to emerge. They are the ones building governance capability now; while the space is still fragmented, while the frameworks are still evolving, while the window to get ahead of regulatory pressure is still open.

This is the first in a ten-part series examining the state of AI governance: what the frameworks actually say, what the EU AI Act actually requires, and what organizations that take this seriously are already doing.

The Framework Landscape

Open any directory of AI governance frameworks and you will find a landscape that looks, at first glance, like chaos. AIGN, AI SAFE² / AISM, AgentID, Microsoft Responsible AI Maturity Model, ARC Framework, Agentic Governance Framework v2.1, Singapore IMDA, MI9, AI Gateway, GaaS, WEF, Microsoft Frontier Governance Framework, Digital Applied. Thirteen frameworks and counting, each with different authors, different scopes, different maturity claims, and different definitions of what “governance” actually means.

This is the surface objection we hear most often: *how can we build governance when we cannot even agree on which framework to use?*

But this framing misunderstands what the frameworks are; and what the EU AI Act is.

The EU AI Act is not a governance framework. It is a legal floor. It defines what is mandatory, what is prohibited, what is enforceable, and what the consequences of non-compliance are. The governance frameworks (AIGN, AISM, NIST AI RMF, ISO 42001) are implementation substrates. They help organizations organize their thinking, structure their controls, and demonstrate due diligence. None of them replaces the EU AI Act. All of them can be positioned against it.

Understanding this distinction is itself a competitive advantage. Most organizations are either ignoring governance frameworks entirely (treating the EU AI Act as the only game in town) or treating governance frameworks as if one of them will become the SOC2 equivalent for AI. Neither position is correct.

What the Frameworks Actually Measure

The thirteen frameworks are not interchangeable. They differ across six dimensions that matter for organizations making decisions about where to invest governance resources.

Maturity : Early (draft or in-progress), Structured (formal but voluntary), Operational (active tooling or product), or Mature (broad adoption with formal certification path). AIGN and Microsoft RAI Maturity Model sit at the operational or mature end. Singapore IMDA and the WEF framework sit at the early end despite their institutional backing.

Scope : Policy-only frameworks describe what should be documented. Technical control frameworks specify how controls should work. Certification frameworks add a trust label or formal certifiable path. Full lifecycle frameworks cover design through run through retire. AIGN and AISM target full lifecycle; AI Gateway targets technical controls with a commercial product.

Agentic AI Specificity : Generic frameworks address AI systems in general. Explicit frameworks add specific provisions for AI agents without redesigning the framework around them. Agentic-native frameworks are designed from the ground up for autonomous reasoning, planning, and action. Singapore IMDA and MI9 are agentic-native. Microsoft RAI Maturity Model is explicit. Most older frameworks are generic.

Regulatory Alignment : Some frameworks are mapped to the EU AI Act and ISO 42001 as reference points. Some are benchmarked against multiple standards. AIGN explicitly maps to EU AI Act, GDPR, NIS2, and DORA. AISM maps to NIST AI RMF, ISO 42001, EU AI Act, CSA AICM, NIST CSF 2.0, MITRE ATLAS, and OWASP LLM. Other frameworks make no regulatory claim at all.

Runtime Governance : This is the most significant dimension of differentiation and the least understood. Most governance frameworks operate at design time or policy time : they tell you what controls to put in place before a system goes live. MI9 is the only framework with formal runtime governance: it operates continuously during live execution using FSM-based temporal conformance and statistical goal-conditioned drift detection. The gap between design-time governance and runtime governance is where most agentic AI failures occur.

Certification : None (self-assessment only), trust label (AIGN offers this), or formal certifiable (ISO 42001 is the closest to a certifiable path for AI governance). Most frameworks offer neither.

The crosswalk that most organizations find useful is between the EU AI Act, NIST AI RMF, and ISO 42001. These three are not competitors : they are complementary. ISO 42001 provides the management system backbone (certifiable). NIST AI RMF provides the operational risk methodology inside those management system clauses. EU AI Act provides the legal overlay for systems in scope. Approximately 70–80% of controls serve all three frameworks simultaneously — evidence collected once maps to all three.

Why the Fragmentation Is a Signal, Not a Problem

The proliferation of frameworks is not a sign that AI governance is immature. It is a sign that the problem is large enough that many serious organizations are trying to solve it from different angles. The EU AI Act’s existence changes the terms of the conversation.

With the EU AI Act as the regulatory floor, the question is no longer “which framework should we adopt?” It is “which framework helps us demonstrate compliance with our binding obligations in the most operationally efficient way?”

Organizations that anchor to the EU AI Act first, then use governance frameworks as implementation tools, find that the framework selection question becomes much more tractable. The answer is usually: use ISO 42001 as the structural backbone, use NIST AI RMF as the operational risk methodology, and use the EU AI Act’s specific obligations as the checklist that determines what else must be added.

This is the approach that organizations with existing ISO certifications typically take; ISO 42001 is already presumed to satisfy Article 16 of the EU AI Act (the quality management system requirement). If you are already ISO 42001 certified, you have the structural foundation. The remaining work is the gap analysis: what do your obligations under the EU AI Act require that your current ISO 42001 implementation does not yet cover?

What You Cannot Do Without Understanding the Landscape

Organizations that treat AI governance as a compliance checkbox (implement a policy, file it, move on) will find themselves in a difficult position when enforcement accelerates in August 2026. The EU AI Act’s enforcement powers become fully operational at that point, and the national authorities responsible for high-risk AI system obligations will begin examining organizations in their jurisdictions.

More importantly, organizations that do not understand the framework landscape cannot have an informed conversation with their board, their legal team, or their technical leadership about where they actually stand. Governance requires a shared vocabulary. The vocabulary is not “are we compliant?”; it is “where are we on maturity, what is our scope, what agentic specificity do we have, what regulatory alignment have we achieved, and do we have runtime governance or only design-time controls?”

The frameworks exist to answer those questions systematically. The EU AI Act exists to make the answers matter legally.

The Right First Question

The question to ask is not “which framework should we use?” It is “what is our current state across the dimensions that matter, and what do we need to build to meet our binding obligations?”

Answering that question correctly requires understanding what each of the six dimensions actually means in your organization; not in the abstract, but in the specific context of what you are building, how your agents operate, what data you access, and who your deployers are.

That is where the real work begins. And it begins with a honest assessment, not a framework selection.

Next: Why the EU AI Act Applies to You - Even If Your HQ Is Not in the EU

Download our free EU AI Act vs. NIST vs. ISO 42001 Quick Reference Decision Guide

AI Framework Comparison - Quick Reference Decision Guide

156KB ∙ PDF file

Download

Affordable AI Workstations in 2026 - A Practical Guide to Running Large Language Models Without Going Bankrupt

Alexis Gil Gonzales — Mon, 04 May 2026 19:28:52 GMT

I’ve been making some research on a hardware upgrade we want to make to our (tiny) datacenter. wo years ago, running a competent language model locally meant making peace with tokens, cryptic setup rituals, and the constant threat of an out-of-memory crash. That era is decisively over. In 2026, you can run a 70-billion-parameter model on hardware that fits on a desk, sips under 200 watts, and costs less than a used car.

Before going further, a distinction that will matter throughout this article: most models discussed are dense models, where every forward pass activates all parameters. Mixture-of-Experts models work differently. A 109-billion-parameter MoE like Llama 4 Scout only activates about 17 billion parameters per token, since each expert handles a slice of the workload. The total model size is still large and needs to live in memory, but the active compute per token is much smaller. This means MoE models fit the memory-capacity constraints of unified-memory platforms far better than their dense cousins at a given parameter count, while delivering competitive quality.

The democratization of large language models is no longer a distant promise. It is happening right now, on consumer-grade hardware, in home offices and small labs. The technology has matured to the point where the barrier to entry is not technical sophistication but simply knowing which hardware to buy and what it can actually do.

This article surveys three categories of affordable AI workstations available in 2026: the AMD Ryzen AI Halo platform built around the Strix Halo APU, NVIDIA’s compact DGX Spark, and the evergreen option of discrete RTX consumer GPUs. We will decompose what LLM workflows actually demand from hardware, match each platform to the tasks it handles well, and close with a look at where this hardware race is heading, including a noteworthy newcomer from Apple.

What Local LLMs Actually Need - Hardware Decomposition

Before comparing machines, it helps to understand what running a language model actually requires at each stage. LLM workflows fall into four broad categories, each with very different hardware demands.

Inference is the simplest operation: you give the model a prompt, it generates a response. The model weights must live in memory. At full 16-bit precision, each billion parameters consumes roughly 2 gigabytes. A 7-billion-parameter dense model needs about 14 gigabytes just for its weights in FP16. A 70-billion-parameter dense model needs 140 gigabytes, which no consumer graphics card can provide. A 109B MoE model at FP16 would need 218 gigabytes total, but only 17 billion parameters activate per forward pass, making the active working set far smaller than the stored size suggests.

The memory requirement to simply store weights is separate from the compute requirement to run them. This distinction matters enormously for hardware selection: a platform that cannot store a dense 70B model may still handle a 100B+ MoE model because the hardware only needs to process the active expert parameters during each forward pass, even though all parameters must remain in memory.

Quantization solves this. By storing weights in 4-bit or 8-bit precision, you shrink model sizes dramatically. The community-standard Q4_K_M quantization fits a 7-billion-parameter dense model in about 4 to 5 gigabytes and a 70-billion-parameter dense model in 35 to 40 gigabytes. MoE models behave differently: a 109B MoE like Llama 4 Scout at Q4_K_M occupies roughly 60 gigabytes total, but only 17B parameters activate per forward pass, making it far more tractable on bandwidth-constrained hardware than a dense 70B despite having more total parameters. This is how consumer hardware became viable for large models: not through more VRAM, but through smarter weight representation and architectural choices that reduce active compute per token.

Context length compounds memory needs through the KV cache, which stores attention state for every token in the context window. A 7-billion-parameter model at FP16 precision needs roughly 1 gigabyte for a 4K context and 8 gigabytes for a 32K context. Push to 128K and you are looking at 32 gigabytes just for the cache on that same 7B model.

Fine-tuning adjusts a pre-trained model’s weights for a specific task. Full fine-tuning, which updates every parameter, is extraordinarily memory-hungry. Beyond the model weights themselves, you need to store gradients (the direction each weight should move) and optimizer states (the AdamW optimizer keeps a running estimate of gradient moments in 32-bit precision). For a 7B model at FP16, this multiplies to roughly 70 to 84 gigabytes total. That is a datacenter workload.

LoRA and QLoRA changed the math. LoRA freezes the base model and trains tiny adapter matrices inserted between layers. The adapters are typically 0.1 to 1 percent of the total parameter count, so a 7B model might only need 16 to 20 gigabytes with LoRA at FP16. QLoRA goes further by also quantizing the frozen base model to 4-bit NF4 format, dropping the 7B requirement to 6 to 10 gigabytes. A 70B model that would need 600 gigabytes for full training fits in 32 to 48 gigabytes with QLoRA.

Knowledge distillation trains a smaller student model to replicate the behavior of a larger teacher model. The teacher runs in inference mode while the student is trained on its outputs. The key nuance is that the student model is typically one-fifth to one-tenth the size of the teacher, so its training cost is dominated by the student architecture, not the teacher. However, the teacher must remain in memory during distillation, adding an inference-time memory overhead. Compared to full training from scratch, distillation is less demanding because the student is small and usually already pretrained on general data, so fewer training steps and a smaller dataset are needed to achieve the target capability. Compared to full training of a same-sized model from scratch, distillation is cheaper still: you are not updating all parameters from random initialization against a full pre-training corpus.

Full training from scratch belongs in an entirely different category. Training a new 7B model from initialization requires hundreds of gigabytes across gradients, optimizer states, activations, and weights. This is the domain of GPU clusters with terabytes of HBM memory. We mention it only to draw a clear line: no workstation in this article is designed for this, and anyone suggesting otherwise is selling fantasy.

The Three Platforms

AMD Ryzen AI Halo

AMD’s Ryzen AI Halo platform, built around the Ryzen AI Max+ 395 processor, is the newest entrant to this market. At its core is a 16-core Zen 5 CPU with an integrated GPU featuring 40 RDNA 3.5 compute units and an NPU rated at 50 TOPS. The standout feature is its unified memory architecture: up to 128GB of LPDDR5X memory shared between the CPU, GPU, and NPU, with a 256-bit memory bus delivering around 212 GB per second in practice.

AMD’s Variable Graphics Memory technology lets you dedicate up to 96 gigabytes of that pool as GPU-addressable VRAM. This is the critical advantage: no consumer discrete GPU comes close to 96 GB of video memory. A 128-gigabyte Strix Halo system running a dense 70-billion-parameter model quantized to Q4 sits comfortably in that headroom while leaving 32 gigabytes for the operating system and tooling. An MoE model of similar total parameter count fits just as easily, since the memory footprint is comparable but the active compute per token is lighter.

The trade-off is bandwidth. At roughly 212 GB per second, Strix Halo is memory-bandwidth-bound for dense 70B models, producing 3 to 5 tokens per second for large dense models. Mixture-of-Experts models like Llama 4 Scout perform better since they only activate a fraction of parameters per forward pass. For smaller models that fit within the bandwidth budget, token rates are competitive with discrete GPUs.

The software story has matured significantly. ROCm 7.2, released in late 2025, brought official PyTorch support on Linux and public preview support on Windows. Most notably, AMD confirmed in January 2026 that ROCm is now a first-class platform for vLLM. llama.cpp runs on AMD GPUs through both ROCm and Vulkan backends, with community testing consistently finding Vulkan the more reliable and often faster path for Strix Halo APUs.

Pricing is the platform’s strongest card. Fully configured mini-PCs with 128 GB of memory and a Ryzen AI Max+ 395 are available for around $2,500 depending on the vendor, with the Framework Desktop and Beelink GTR9 Pro representing the most polished options. AMD’s own reference platform, launching mid-2026, is expected in the $2,000 to $3,000 range.

NVIDIA DGX Spark

NVIDIA’s DGX Spark, formerly known as Project DIGITS, is the smallest member of the DGX family but shares the same software stack as its datacenter siblings. It is built around the GB10 Grace Blackwell Superchip, pairing a 20-core Arm CPU with a Blackwell GPU featuring 6,144 CUDA cores and fifth-generation Tensor Cores. The result is a system capable of 1 PetaFLOP at FP4 sparse precision.

The DGX Spark ships with 128 GB of unified LPDDR5X memory at 273 GB per second bandwidth, running NVIDIA’s DGX OS 7.4, which is Ubuntu 24.04 with CUDA 13 and a curated AI software stack pre-installed. PyTorch with Blackwell optimizations, TensorRT-LLM, vLLM, Ollama, and Docker with NVIDIA Container Runtime all come configured out of the box. If you want to prototype on a compact desktop machine and deploy to an H100 cluster, the software environment is identical on both.

Inference scales to 200 billion parameters at FP4 quantization on a single unit, with two DGX Spark units linked via ConnectX-7 at 200 gigabits per second capable of running 405 billion parameter models. Fine-tuning supports full fine-tuning up to 8 billion parameters at 16K context length and QLoRA up to 70 billion parameters. The CUDA ecosystem, torch.compile support, and native Docker GPU passthrough make this the most capable platform for developers working across training, fine-tuning, and inference.

The current price is $4,699 for the Founder’s Edition with 4 TB of NVMe storage, up from the $2,999 announcement price and $3,999 reservation price, reflecting the realities of the memory market in early 2026.

Discrete RTX GPUs

The traditional path to local AI compute remains relevant in 2026, now powered by the RTX 50 series Blackwell architecture alongside capable used RTX 40 series hardware.

The RTX 5090 leads the consumer lineup with 32 gigabytes of GDDR7 memory, 1,792 GB per second bandwidth, and 3,352 AI TOPS. At $1,999 MSRP, it is the only single consumer GPU that can fit a dense 70B model at Q4 with meaningful context headroom. MoE models are a different story: a 200B+ MoE at Q4 may need 80 to 100 gigabytes just for weights, and while the active compute per token is smaller, the total storage requirement is the bottleneck for loading. The RTX 5090 is PCIe 5.0, which matters more for fine-tuning data movement than for inference.

The RTX 5070 at $549 MSRP has displaced the RTX 4070 as the budget sweet spot. With 12 GB of GDDR7 and roughly 35 to 45 percent faster inference performance than the RTX 4070 Ti, it handles 7B and 13B models comfortably and fits 14B at Q4.

The used market offers compelling alternatives. The RTX 4090 at $700 to $900 used delivers 24 GB of GDDR6X and remains the best single-GPU option for 30B+ models. The older RTX 3090 at $450 to $600 used is the best VRAM-per-dollar option for running larger models on a budget, trading some speed for the 24-gigabyte ceiling.

The RTX platform’s advantage is ecosystem breadth. CUDA, PyTorch, TensorRT, and every popular inference framework have been optimized for NVIDIA consumer GPUs for over a decade. The community knowledge base is unparalleled. The disadvantage is the same as always: discrete VRAM is finite and expensive to expand. A 70B model at high context lengths will simply refuse to load on any single consumer GPU.

Matching Platforms to Workflows

Pure inference on 70B+ models: Strix Halo and the DGX Spark have no competition among consumer-class hardware here. Neither can match the token throughput of a bandwidth-rich discrete GPU on smaller models, but both can load a dense 70B Q4 model that a single RTX 5090 can only fit at tighter quantizations or with CPU offloading. MoE models extend this advantage further: Strix Halo’s 96 GB VGM allocation accommodates MoE models up to roughly 200 billion total parameters at Q4, delivering usable token rates despite the massive total size, because only a fraction of parameters activate per forward pass. Strix Halo wins on dollar value; the DGX Spark wins on CUDA ecosystem and fine-tuning capability.

Development, fine-tuning, and CUDA workflows: The DGX Spark is the default recommendation. The out-of-the-box software stack, torch.compile support, and seamless path from prototype to datacenter deployment make it the most serious development workstation in this comparison. If you are writing training code, building agents, or iterating on fine-tuning recipes, this is the machine that will not fight you.

Speed-first inference on smaller models: An RTX 5090 or RTX 4090 used setup wins on tokens per second for any model that fits in its VRAM. For 7B through 34B models at Q4 with 8K to 32K context, discrete GPUs generate tokens 3 to 4 times faster than unified-memory platforms at similar cost.

Fine-tuning on a budget: Strix Halo at $800 to $1,500 is surprisingly capable with QLoRA. A 7B model fine-tune fits in 6 to 10 gigabytes. A 13B model needs 10 to 16 gigabytes, still within Strix Halo’s 96-gigabyte VGM allocation. The DGX Spark handles LoRA and QLoRA comfortably and adds full fine-tuning for 8B models. RTX 4090 used remains competitive for LoRA on 7B to 13B models.

Portable workstation: The DGX Spark’s 1.2-kilogram footprint and sub-200-watt power draw make it the most capable machine that can live permanently on a desk without becoming furniture. Strix Halo mini-PCs are similar in this regard. Neither is laptop-class, but both are far more desk-friendly than a traditional workstation with a full-length graphics card.

Hardware Evolution and the Road Ahead

The unified memory architecture that started with Apple’s M1 Ultra in 2021 has become the defining trend in AI workstation design. Both the DGX Spark and AMD Strix Halo have followed Apple’s lead, confirming that the traditional separation between system RAM and GPU VRAM is an artifact of PCI Express bandwidth constraints, not an engineering necessity.

This convergence is happening because LLM inference does not map cleanly onto either CPU or GPU design points. The attention mechanism is memory-bandwidth-bound rather than compute-bound, which means raw shader FLOPS matter less than memory capacity and bandwidth. Unified memory eliminates the PCIe bottleneck at the cost of sharing bandwidth between compute and memory, a trade-off that increasingly favors large model capacity over raw speed.

The other significant trend is the maturation of quantization. What started as a desperate measure to fit large models into small VRAM has become a first-class inference technique. Q4_K_M is now the community standard, and frameworks like llama.cpp and vLLM handle it with no user intervention required. The average practitioner no longer needs to understand the difference between GPTQ and AWQ internals to run a 70B model on consumer hardware.

A Special Note on Apple M5 Max

No survey of AI workstations would be complete without acknowledging Apple’s Silicon trajectory, even though the M5 Max sits outside the three platforms we have focused on.

The M5 Max, announced in March 2026, introduces a dual-die 3-nanometer Fusion Architecture with up to 40 GPU cores each containing a dedicated Neural Accelerator, in addition to a 16-core Neural Engine. With up to 128 GB of unified memory at 614 GB per second bandwidth, it runs a 70B Q4 model at roughly 85 tokens per second, comparable to an RTX 5090 for that specific task while consuming a fraction of the power.

The architectural novelty is the per-core Neural Accelerator, which provides dedicated matrix-multiplication throughput that the previous generation lacked. Apple’s MLX framework is the software layer that extracts this performance, and it is genuinely impressive for a unified-memory platform. The catch remains what it has always been: MLX is Apple-only. Any code written for it does not transfer to CUDA or ROCm environments.

For researchers and developers invested in the Apple ecosystem, the M5 Max is the best portable AI machine Apple has shipped. For everyone else, the ecosystem lock-in is a legitimate concern that the hardware excellence cannot fully offset.

---

The Ecosystem Divide

One practical dimension that cuts across all three platforms is the software ecosystem.

NVIDIA’s CUDA is the default for AI research. PyTorch, TensorFlow, JAX, and every major ML framework have CUDA as their first-class target. The DGX Spark runs the same containers as an H100 cluster. If your work involves anything beyond inference, CUDA compatibility is not optional, it is the foundation.

AMD’s ROCm has closed much of the gap in 2025 and 2026. vLLM support, growing PyTorch compatibility, and llama.cpp validation on Strix Halo APUs have removed the worst pain points. ROCm still requires more care in setup than CUDA, and some libraries lag behind, but the trajectory is clear. AMD is serious about being a CUDA alternative, not just a cheaper CUDA alternative.

Apple’s MLX is the fastest framework on M5 Max hardware but exclusively available on Apple Silicon. The lock-in is real. MLX models are not drop-in replacements for CUDA models, and the community support, while growing, does not approach the breadth of the NVIDIA ecosystem.

The practical implication is straightforward: if you are doing anything beyond inference, the DGX Spark’s software advantage is substantial. If you are purely running inference and cost matters, Strix Halo offers the best model-capacity-per-dollar by a significant margin. If you are already in the Apple ecosystem and want the fastest possible local inference, the M5 Max delivers.

---

Summary and Closing Thoughts

The following table summarizes our comparison.

And as a rough capability matrix we’d have :

2026 is a peculiar year to be buying AI hardware. The pace of improvement means that any machine purchased today will look modest within two years. But the floor has risen dramatically. A $2,000 Strix Halo box can run models that required a $50,000 server rack two years ago. A $4,700 DGX Spark gives you datacenter-class software in a desktop form factor. Even a $550 RTX 5070 handles inference workloads that demanded a dual-GPU workstation not long ago.

The question is no longer whether local AI is viable. It is which platform matches your workflow, your ecosystem preferences, and your budget. The good news is that all three options covered here are legitimate, production-capable machines, not science projects. Pick the one that fits how you actually work

Local Models - A Guide to Running LLMs on Your Hardware

Alexis Gil Gonzales — Fri, 01 May 2026 17:45:30 GMT

Not so long ago, running a halfway-decent language model locally meant praying to the GPU gods and sacrificing your afternoon to a loading spinner. Those days are gone. The local LLM ecosystem has evolved at a pace that would make even the most jaded tech optimist raise an eyebrow.

Today’s landscape offers genuine, production-ready models for nearly every machine configuration : from beefy workstations with 64GB+ of RAM down to laptops that wouldn’t dream of touching a gaming GPU. The question isn’t whether local AI is viable anymore. It’s which model actually belongs on your system.

This guide cuts through the noise. I’ve attempted to draw a clear picture of where each model shines, where it wobbles, and (most importantly) which task it was born to handle.

Let’s dive in.

Understanding the Stack - Quantization, VRAM, and Why It Matters

Before getting to the good stuff, a quick explainer for the uninitiated.

VRAM (Video RAM) is the memory on your graphics card. When running AI models locally, all the model weights and computations happen on the GPU, so you need enough VRAM to hold the entire model; think of it as RAM dedicated to your graphics card. More VRAM means you can run larger models or higher precision quantization.

When we talk about models in this guide, we’re almost exclusively discussing GGUF-quantized models : a format that squeezes large models into smaller file sizes with minimal quality loss. The quantization level (Q8_0, Q6_K, Q4_K_M, etc.) represents the precision at which the model’s weights are stored. Lower numbers mean smaller files and less VRAM required, but also some quality degradation.

For context: a Q8_0 model retains ~99% of quality but needs significantly more VRAM than a Q4_K_M model at ~95% quality. For most users, Q6_K or Q4_K_M hits the sweet spot between size and performance.

Also worth noting: context length (e.g. how many tokens the model can “see” at once) varies dramatically. Some models offer 128K tokens (enough for a short novel), while others push to 1M tokens (enough to ingest your entire code repository).

Let’s meet the players.

The 64GB Tier - Where Power Meets Patience

These are the models that make you reconsider whether you really needed that monitor upgrade. They demand serious hardware but deliver real flagship-level performance.

Qwen3.6-27B - The General Purpose Titan

If you want one model that does nearly everything at a high level (coding, reasoning, agent workflows, general chat) the Qwen3.6-27B is your workhorse. This dense model punches well above its weight class, achieving 77.2% on SWE-bench Verified (a coding benchmark) and 94.1% on AIME26 (math olympiad problems). For reference, that’s better than some models twice its size.

The secret sauce? A hybrid architecture combining Gated DeltaNet with Gated Attention, plus multimodality that handles images and video out of the box. The “Thinking Preservation” mechanism keeps multi-step reasoning coherent across iterations : a godsend for complex agent tasks.

At ~28.6GB in Q8_0, you’ll need a serious GPU. This isn’t a model for the faint-hearted or the RTX 3060 crowd.

Qwen3.6-35B-A3B - The Efficient Performer

Meet the MoE (Mixture of Experts) counterpart to the 27B. With only 3 billion parameters active per token out of 35B total, this model flies : 3-5× faster inference than the dense 27B while maintaining comparable quality.

The benchmark picture tells the story : 73.4% on SWE-bench Verified, 92.7% on AIME26, and 85.2% on MMLU-Pro. If you’re building coding agents or need fast iteration on complex tasks, this is the MoE you want.

The downslde is that Q6_K still needs ~25.6GB, so plan accordingly.

Llama 3.3 70B - The Safe Big-Model Choice

Meta’s 70B remains the reliable choice for workloads that need breadth. With 128K context and excellent multilingual support, it’s the workhorse for long-form writing, broad world knowledge, and situations where model reliability trumps raw benchmark chasing.

On IFEval (instruction following), Llama 3.3 70B actually outperforms models twice its size at 92.1%. It’s the model you reach for when you need to trust that your instructions will be followed without drama.

It won’t win any benchmark beauty pageants in 2026. But it will consistently deliver solid outputs, and sometimes that’s worth more than a flashy leaderboard position.

Gemma 4 31B - The Reasoning Champion

Google DeepMind’s Gemma 4 31B is the math and coding specialist that makes other models nervous. With 89.2% on AIME 2026 (math) and a Codeforces ELO of 2150, it’s the choice when your work involves analytical challenges.

The multimodal support is excellent (text, image, audio, video) and the native thinking mode lets you watch the model reason through problems step by step. For anyone doing complex technical work, this model’s thinking process is almost as valuable as its outputs.

At ~32.6GB in Q8_0, it’s a workstation model. But if your work involves heavy reasoning and coding, the investment pays off.

Kimi-Linear-48B-A3B - The Context King

When 1M token context lengths were a novelty, the Kimi-Linear-48B made them practical. Its hybrid linear attention architecture delivers 6× faster decoding at 1M tokens compared to traditional attention, with a 75% reduction in KV cache memory usage.

For research, massive document analysis, or whole-codebase Q&A, this is the model that makes “ingest everything” actually feasible on local hardware. Just don’t expect the highest raw benchmark scores : the architecture trades some MMLU-Pro performance for that incredible context efficiency.

The Specialists

Three more models deserve quick recognition:

- Nemotron Super 49B v1.5 : NVIDIA’s reasoning specialist, hybrid Mamba-Transformer architecture, optimized for agentic tasks. The 1M token context is real, not marketing.

- Qwen3-30B-A3B-Thinking-2507 : The thinking model for when you need visible, step-by-step reasoning on math and logic problems. 85% on AIME25, with a thinking mode that actually works.

- Qwen3-VL-32B : Vision-language specialist for OCR, document parsing, chart analysis, and multimodal agent workflows. If you need to understand images deeply, this is your model.

The 32GB Sweet Spot - High Performance, Realistic Hardware

This tier represents the realistic sweet spot for most power users : machines with 32GB VRAM that still want decent performance without the workstation upgrade.

Qwen3.5 27B — The People’s Champion

The Qwen3.5 27B is what happens when quantization maturity meets excellent architecture. At Q6_K_M (~25GB), it delivers 86.1% on MMLU-Pro and 95% on IFEval; numbers that would have turned heads two years ago at twice the size.

The 262K native context means you can actually use that context without jumping through YaRN hoops. Multimodal support handles images. Tool calling works reliably. For general-purpose work (writing, research, coding, agents) this model delivers great quality at a realistic footprint.

Gemma 4 31B - Premium Quality, Premium Price

The 31B dense model for when quality is non-negotiable and speed is nice-but-not-essential. At 89.2% AIME and 2150 Codeforces ELO, the benchmark case writes itself. The 256K context and multimodal support are just icing.

On the minus side, Q6_K needs ~25GB, and Q4 still wants 18GB. Plan your VRAM accordingly.

Qwen3.6-35B-A3B (UD-Q4_K_M) - The Efficient Performer

Remember this model from the 64GB tier? At Q4_K_M quantization (~20-21GB), it becomes a real 32GB option without meaningful quality loss. The MoE architecture means you get 35B parameters worth of quality at 3B active parameter speed.

For coding agents, tool use, and fast iteration; this is the model that lets your 32GB machine punch above its weight class.

DeepSeek-R1 Distill Qwen 32B - The Math Specialist

When your work is math-heavy, this distilled DeepSeek R1 model delivers exceptional results. 94.3% on MATH-500 and 72.6% on AIME 2024. Those numbers belong to models twice its size.

The tradeoff : code performance (57.2% on LiveCodeBench) lags behind the generalists, and the 128K context is notably shorter than competitors. But if you’re building a math-focused application, the R1 distillation is remarkably cost-effective.

Mistral Small 24B - The Agentic All-Rounder

Mistral’s 24B hits a different niche : tool-calling and agent workflows. With 84.8% on HumanEval and strong instruction following, it’s the model for building assistants that need to reliably call functions and execute multi-step workflows.

The 32K context is the main limitation. But for local business automation, chat interfaces, and function-calling heavy applications, Mistral Small delivers at a reasonable footprint (~19GB Q6_K).

The Supporting Cast

- Gemma 4 26B A4B : MoE efficiency with 4B active params out of 26B total. Lower absolute performance than the 31B dense, but excellent for its VRAM efficiency.

- Qwen3.5 9B : A remarkable compact performer at ~13GB. 82.5% MMLU-Pro makes it a legitimate daily driver for users who don’t need maximum quality.

- Llama 3.1 8B : The stable, mature option for users who value ecosystem over benchmarks. Still useful for RAG, document ingestion, and long prompts.

The 16GB Realism Tier - Doing More With Less

This is where local AI gets really democratized. 16GB VRAM (the domain of gaming GPUs and mobile workstations) can now run models that would have been science fiction a few years ago.

Qwen3.5 9B - The Daily Driver

At ~9GB in Q4_K_M, this is the model that fits on an RTX 3060 while delivering 82.5% MMLU-Pro and 91.5% IFEval. That’s not “good for a 9B model” : that’s just good, period.

For general chat, drafting, research, and daily tasks, the Qwen3.5 9B is the choice that makes local AI practical for anyone with consumer hardware.

DeepSeek-R1 Distill Qwen 7B - The Math Genius

The distilled R1 reasoning capabilities in a 4-5GB package.92.8% on MATH-500 and 55.5% on AIME. For math-focused applications, this is the budget choice that doesn’t compromise.

Just don’t ask it to code. At 37.6% on LiveCodeBench, the R1 distillation is a specialist, not a generalist.

Qwen2.5 Coder 7B - The Code Specialist

Speaking of specialists: if coding is the task, the Qwen2.5 Coder 7B delivers ~85% on HumanEval at ~4.7GB. For completions, refactors, debugging, and repo Q&A, this is the dedicated code model that beats generalists on code tasks.

On the minus side, general knowledge (40.1% MMLU-Pro) is not this model’s strength.

Phi-4 Mini Reasoning - The Compact Thinker

At 3.8B parameters and ~2.5GB, the Phi-4 Mini Reasoning punches far above its weight class on math. 94.6% on MATH-500 and 57.5% on AIME; those numbers are remarkable for a sub-4GB model.

English-only is the core limitation. But for math-heavy applications where you need reasoning in a tiny package, Phi-4 Mini Reasoning is a revelation.

Gemma 4 E4B - The Multimodal Lightweight

For tasks that need vision without the VRAM cost, Gemma 4 E4B delivers text + image + audio understanding at ~5-6GB. It’s the model for edge deployment, laptops without dedicated GPUs, and applications that need multimodal support without the flagship footprint.

Benchmarks are modest, but the capability-to-footprint ratio is exceptional.

The Micro Models

The bottom of the stack still has great utility:

- Phi-3.5 Mini : Strong code (86% Python) and 128K context in ~2.8GB. Older model, but still useful.

- Qwen3.5 2B : 262K context in 1.3GB. The tiny giant for long-context retrieval tasks.

- Qwen3.5 0.8B : 262K context in under 1GB. Classification, routing, triage; tasks that don’t need reasoning.

- Gemma 4 E2B-it: Multimodal in 4GB. Runs on smartphones. The edge AI frontier.

Use Case Recommendations

After reviewing the full landscape, here’s a practical guidance.

For 64GB+ Workstations :

- General purpose : Qwen3.6-27B, the do-anything flagship

- Speed + quality : Qwen3.6-35B-A3B, MoE efficiency, superior quality

- Math & reasoning : Gemma 4 31B, 89.2% AIME speaks for itself

- Longest context : Kimi-Linear-48B-A3B, 1M tokens, 6× faster

- Coding agents : Qwen3-Coder 30B-A3B, specialized for code work

For 32GB Machines :

- Best overall : Qwen3.5 27B, benchmark leader, excellent quality

- Value pick : Qwen3.6-35B-A3B Q4_K_M, MoE efficiency at realistic VRAM footprint

- Premium quality : Gemma 4 31B Q6_K, when quality trumps everything else

- Math focus : DeepSeek-R1 32B, the math specialist

- Tool calling : Mistral Small 24B, agentic workflows done right

For 16GB Machines :

- Best benchmarks : Qwen3.5 9B Q4, leaderboard-level scores at consumer GPU price

- Math + budget : DeepSeek-R1 7B, exceptional math, tiny footprint

- Coding specialist : Qwen2.5 Coder 7B, dedicated code model

- Compact reasoning : Phi-4 Mini Reasoning, 2.5GB of math magic

- Edge/mobile : Gemma 4 E2B, truly portable AI

Liquid Foundation Models - The Architecture That Thinks Differently

When every other model in this guide relies on Transformer derivatives (attention mechanisms, feed-forward layers, the usual suspects) Liquid Foundation Models take a fundamentally different approach. Built on Liquid Neural Networks (LNNs), these models are rooted in dynamical systems and signal processing rather than the attention-is-all-you-need paradigm. The result is a family of models that prioritizes real-time adaptation, millisecond latency, and genuine on-device deployment.

Liquid AI’s model lineup spans two main series: LFM2 and the newer LFM2.5, with variants ranging from 350M to 24B parameters. The philosophy is consistent across sizes: build models that run efficiently anywhere, from cloud servers to smartwatches, without sacrificing reliability.

The LFM2 Series - Production-Ready Foundations

The LFM2 series represents Liquid AI’s current production offering, designed for developers who need deploy-anywhere flexibility.

Text Models

- LFM2-350M : The lightest option in the family. CPU, NPU, and GPU execution make it genuinely device-agnostic : this model can run on hardware that wouldn’t dream of running a Llama variant. Benchmarks are modest (43.43% MMLU, 65.12% IFEval), but for simple classification, extraction, and routing tasks, it’s remarkably capable.

- LFM2-700M : The efficiency midpoint. Multilingual support is a standout feature — if you’re building applications that need to handle non-English text without cloud dependency, this model’s language handling is a genuine asset. 49.9% MMLU and 72.23% IFEval place it ahead of Qwen3.5 2B on most metrics while maintaining a similar footprint.

- LFM2-8B-A1B : The 8-billion parameter MoE variant with only 1B active parameters per token. This is Liquid AI’s answer to the Qwen3.5 9B question : comparable quality to dense 8B models at a fraction of the active compute. For on-device AI assistants, local chat, and privacy-sensitive applications, this model makes strong sense.

- LFM2-24B-A2B : The flagship text model in the LFM2 series. With 24B total parameters and only 2B active per token, it achieves tool-calling and agentic capabilities on consumer hardware without cloud dependency. This is the Liquid model for serious local agents; the one that may replaces your cloud API calls for all but the most demanding tasks.

Vision-Language Models

- LFM2-VL-450M : Compact multimodal processing at under 500M parameters. Text and image understanding in a package that can run on edge devices. For mobile applications, IoT dashboards, and vision tasks where latency matters, this model delivers.

- LFM2-VL-3B : The larger vision specialist at 3 billion parameters. Edge-optimized but capable of meaningful image understanding, document parsing, and multimodal agent workflows. This is the vision model for applications that need real image comprehension but can’t afford cloud round-trips.

The LFM2.5 Series - Scaled Intelligence

The LFM2.5 series marks Liquid AI’s next evolution, with models pretrained on 28T tokens using a scaled reinforcement learning pipeline. The quality jump is noticeable across the board.

Text Models

- LFM2.5-1.2B-Base : The base model for the 2.5 series. 28T tokens of pretraining gives this 1.2B model a quality floor that punches well above its weight class. For developers who need a reliable base to fine-tune, this is a strong starting point.

- LFM2.5-1.2B-Instruct : The instruction-tuned variant, optimized for agentic tasks and reliable instruction following. If you’re building local assistants, this model delivers the follow-instructions behavior you’d normally need a 7B+ model for.

- LFM2.5-1.2B-Thinking : The reasoning variant enables on-device reasoning under 1GB. Yes, a thinking/reasoning model that fits in less than 1GB of memory. For math-heavy applications where you want visible step-by-step reasoning on embedded hardware, this is a fine achievement.

- LFM2.5-350M : The smallest LFM2.5 model. Liquid AI’s “no size left behind” philosophy means even the smallest model gets the full treatment. This isn’t a neglected also-ran, it’s a first-class citizen in the family.

Vision-Language Models

- LFM2.5-VL-1.6B : Production-ready multimodal agents on any device. 1.6B parameters handling text and images together, built for the kind of edge deployment that other vision models can’t achieve.

- LFM2.5-VL-450M : The compact vision option for the 2.5 series. Structured visual intelligence at the edge, with the same architectural benefits as the rest of the Liquid lineup.

Audio Model

- LFM2.5-Audio-1.5B : End-to-end speech and text generation. 1.5B parameters for low-latency, high-quality conversations. If you’re building voice interfaces that need to work locally, Liquid’s audio model is worth serious attention.

Task-Specific Nano Models

Liquid AI also offers a collection of specialized nano models optimized for specific workloads:

The ColBERT embedding model deserves special mention : Liquid AI’s approach to unified embeddings means you might be able to replace three or four separate embedding models with one. For production systems where embedding quality matters, this is worth benchmarking against separate embedding + retrieval pipelines.

How Liquid Compares

A meaningful comparison against the other models in this guide requires context. Liquid’s architecture is fundamentally different from the Transformer-based models that dominate this article. This isn’t a direct competitor to Qwen3.6-27B or Gemma 4 31B in raw benchmark terms. Instead, Liquid models compete on:

Against Qwen3.5 9B / Llama 3.1 8B : The LFM2-8B-A1B offers comparable quality with MoE efficiency. 1B active params versus 8B dense. For on-device deployment where active parameter count directly maps to latency, Liquid’s architecture advantage is real.

Against Phi-4 Mini Reasoning : The LFM2.5-1.2B-Thinking at under 1GB is the direct competitor to Phi-4 Mini Reasoning for on-device math reasoning. Liquid’s dynamical systems approach may offer advantages in multi-step reasoning coherence.

Against Gemma 4 E4B / E2B : LFM2-VL-450M and LFM2-VL-3B offer comparable vision capabilities with Liquid’s architectural benefits: millisecond latency, true on-device execution, NPU optimization.

The architectural differentiation : Where Transformer models scale poorly beyond their training context length, Liquid Neural Networks handle continuous inputs more naturally. For real-time applications, robotics, time-series analysis, or any task where inputs evolve over time, Liquid’s architecture offers fundamental advantages that benchmark comparisons don’t capture.

The enterprise perspective : Liquid AI’s LEAP platform enables customization and fine-tuning within enterprise firewalls. For organizations that need proprietary models but lack the infrastructure to train from scratch, this is an interesting differentiator.

The benchmark table tells part of the story (LFM2-350M at 43.43% MMLU, LFM2-1.2B at 55.23% MMLU), but Liquid’s real value proposition is architectural : models that adapt in real-time, deploy anywhere, and prioritize latency in ways that Transformer-based models fundamentally cannot. If your use case fits that profile, Liquid Foundation Models are worth serious evaluation.

Where Liquid Falls Short

The marketing around Liquid Foundation Models is compelling, but the full picture includes real limitations that matter depending on your use case:

Benchmark gaps on core tasks : Liquid AI’s own documentation concedes that LFMs currently struggle with zero-shot code tasks, precise numerical calculations, and tasks that require counting (famously, counting the letter ‘r’ in “strawberry”). For coding agents, math-heavy workloads, or anything requiring precise arithmetic, the Transformer-based models in this guide (Qwen, Gemma, DeepSeek-R1) will consistently outperform Liquid at comparable model sizes.

Retrieval-intensive task limitations : The LFM2 technical report explicitly acknowledges that models with linear attention and state-space operators have “limitations in retrieval-intensive tasks.” Tasks like associative recall (looking up a value given a key from earlier in the context) are fundamental weaknesses of RNN-style architectures versus Transformers. If your application involves querying information across long contexts, Liquid’s architecture is at a structural disadvantage.

Weaker instruction following than competitors : The benchmark numbers don’t lie — LFM2-1.2B scores 74.89% on IFEval (instruction following) while Qwen3.5 2B scores higher despite being a similar footprint. On agentic tasks that require reliable tool use, multi-step reasoning, and strict adherence to instructions, Liquid models trail the Transformer-based competition.

Training and optimization complexity : Liquid Neural Networks introduce additional complexity that the broader ecosystem hasn’t fully caught up with. Training LNNs involves Backpropagation Through Time (BPTT), gradient stability concerns (vanishing/exploding gradients in continuous-time dynamics), and ODE solver overhead; especially for the original LTC formulations. While CfC (Closed-form Continuous-time) models address the speed bottleneck, the tooling and operational expertise required is still significantly higher than working with standard Transformer models.

Ecosystem immaturity : The sheer breadth of tooling, quantized variants, fine-tuned derivatives, and community support that exists for Qwen, Llama, and Gemma doesn’t yet exist for Liquid. If you hit a problem, you’re more likely to be in uncharted territory. The “new programming paradigm around working with operators, blocks, and backbones” that Liquid requires is really different; there’s a learning curve that the established model families don’t impose.

Scaling ceiling : While individual Liquid models are parameter-efficient, the architecture faces open questions about scaling to extremely large model sizes. Research notes that “scaling liquid neural networks to very large and high-dimensional state spaces remains open,” and the sequential nature of ODE solving limits parallelization in ways that Transformers don’t face.

Less mature RLHF and preference optimization : Liquid AI notes that “human preference optimization techniques have not been applied extensively to our models yet.” The alignment techniques (RLHF, DPO, constitutional AI) that make Transformer models feel truly helpful and safe are less developed in the Liquid lineup. This shows up in instruction following and general helpfulness benchmarks.

Noise resilience concerns : Standard LNNs may produce overly confident predictions in noisy environments due to a lack of inherent uncertainty mechanisms. Research into uncertainty-aware variants aims to fix this, but it’s a known gap in the current production models.

The bottom line: Liquid Foundation Models excel at edge deployment, latency-sensitive applications, and real-time adaptive tasks. But for general-purpose code generation, math reasoning, and retrieval-heavy workloads (the tasks that dominate most local AI use cases) the Transformer-based models in this article deliver better results out of the box.

The Road Ahead

The local LLM landscape in 2026 is genuinely remarkable. What once required datacenter resources now fits in your workstation; and increasingly, in your laptop bag. The combination of MoE architectures, improved quantization techniques, and hybrid attention mechanisms means the gap between “local” and “cloud” performance is narrower than ever.

Whether you’re running a coding agent, doing research on massive document collections, building a local assistant, or just want AI that respects your privacy, there’s never been a better time to go local.

Your RAM has been waiting for this moment. Time to put it to work.

Models discussed in this article : Qwen3.6-27B, Qwen3.6-35B-A3B, Llama 3.3 70B, Nemotron Super 49B, Gemma 4 31B, Kimi-Linear-48B-A3B, Qwen3-30B-A3B-Thinking-2507, Qwen3-Coder 30B-A3B, Qwen3-VL-32B, Qwen3.5 27B, Gemma 4 26B A4B, DeepSeek-R1 Distill 32B, Mistral Small 24B, Qwen3.5 9B, Llama 3.1 8B, Qwen3.5 9B (16GB), DeepSeek-R1 7B, Qwen2.5 Coder 7B, Phi-4 Mini Reasoning, Gemma 4 E4B, Phi-3.5 Mini, Qwen3.5 2B, Qwen3.5 0.8B, Gemma 4 E2B-it, LFM2-350M, LFM2-700M, LFM2-8B-A1B, LFM2-24B-A2B, LFM2-VL-450M, LFM2-VL-3B, LFM2.5-1.2B-Base, LFM2.5-1.2B-Instruct, LFM2.5-1.2B-Thinking, LFM2.5-350M, LFM2.5-VL-1.6B, LFM2.5-VL-450M, LFM2.5-Audio-1.5B, and Liquid nano models (Extract, Math, RAG, Tool, ColBERT, Transcript, MT).

Agent Systems Design Space: Architecture, Competition, and the Horizon Ahead

Alexis Gil Gonzales — Mon, 27 Apr 2026 14:40:07 GMT

Design Space Analysis - Five Values, Thirteen Principles

The paper’s starting point is honest: architecture is values made concrete. Before describing a single component, the authors identify five human values that drive the design of an AI coding agent:

Human Decision Authority : the human remains the point of control, even when the agent could plausibly proceed autonomously.
Safety, Security, and Privacy : system design treats these not as features to add later but as structural constraints.
Reliable Execution : agents must produce consistent, repeatable outcomes rather than fluky brilliance followed by spectacular failures.
Capability Amplification : the system exists to make the human more capable, not to demonstrate the model’s capabilities.
Contextual Adaptability : the agent must gracefully handle radically different contexts, from a two-person startup to a regulated enterprise.

These five values translate, somewhat heroically, into thirteen design principles. The most interesting is “deny-first permission evaluation” : the system assumes any action requires explicit permission until proven otherwise. This is architecturally unusual. Most agent frameworks adopt an open-by-default model where tools are available unless explicitly restricted. Claude Code inverts this, treating permission as a first-class architectural concern rather than a later addition.

The permission system itself is notably sophisticated: seven distinct modes (plan, default, auto, dontAsk, bypassPermissions, bubble for subagent escalation, and acceptEdits), backed by an ML-based classifier. The classifier evaluates each action against the permission context and decides whether to surface a prompt to the user. This is not a hard-coded rules engine; it learned from real usage patterns.

The five-layer context compaction pipeline is the other substantial architectural contribution. Rather than treating context as cheap and abundant, the design treats it as a scarce resource to be managed actively. The pipeline has five stages: budget reduction, snip, microcompact, context collapse, and auto-compact; each representing a progressively more aggressive form of context reduction. This acknowledges the uncomfortable reality that context windows are finite, expensive, and degrades gracefully as you approach their limits.

The architecture also implements four extensibility mechanisms : MCP servers, plugins, skills, and hooks; each with different context costs and capability profiles. This is a thoughtful recognition that users will inevitably want to extend the system, and the question is not whether to enable extension but how to make it survivable from a context standpoint.

A striking finding : only about 1.6% of the codebase handles actual AI decision-making. The remaining 98.4% is operational infrastructure. This should alert anyone who has spent the last two years arguing about which foundation model to use. The scaffold matters more than the model. More on this later.

Architecture Comparison with OpenClaw - Same Question, Different Answers

The comparison between Claude Code and OpenClaw is the paper’s most intellectually satisfying section, precisely because it resists the temptation to declare a winner. Instead, it reveals how identical design questions produce contextually appropriate answers.

Both systems are AI coding agents. Both must handle the same fundamental challenges : how to evaluate permissions, how to manage context, how to expose extensibility. And yet there are a few notable differences :

Permission evaluation : Claude Code performs per-action safety evaluation within a CLI loop. OpenClaw employs perimeter-level access control within a gateway control plane. These are not merely different implementations; they reflect different deployment philosophies. Claude Code assumes a single user at a terminal, making fast per-action decisions tractable. OpenClaw assumes a multi-user gateway context where perimeter control is more efficient than per-action prompting. Neither is universally right.

Context management : Claude Code’s five-layer compaction pipeline has a direct analogue in OpenClaw’s gateway-wide capability registration. The mechanisms differ but the underlying problem is identical: context is expensive, and you need a strategy for managing it before it runs out. The architectural difference reflects deployment context: CLI agents see context as a per-session problem; gateway agents see it as a system-wide resource to be allocated across many concurrent sessions.

Extensibility : Claude Code’s four-tier extensibility model (MCP servers, plugins, skills, hooks) has a rough parallel in OpenClaw’s gateway extension mechanisms. The OpenClaw architecture appears to lean more heavily on gateway-level registration, whereas Claude Code distributes extensibility across different cost profiles.

The deeper point is one the paper makes quietly but clearly : the design space for agent systems is context-dependent in ways that make direct comparison philosophically suspect. Claude Code is designed for a developer sitting at a terminal who wants a capable, safe, fast coding partner. OpenClaw is designed for organizations that need to deploy agents at scale behind a gateway with consistent governance. These are genuinely different problems, and the architectures reflect those differences. A framework that declares one superior to the other without specifying the deployment context is not making a scientific claim : it is making noise.

This is a welcome corrective to the benchmark-driven discourse that dominates agent system evaluation. SWE-bench scores, HumanEval results, and similar metrics are useful signals but they are not architecture evaluations. The paper’s implicit argument (that architecture choices are too important to leave to leaderboard position comparisons) is well taken.

Open Directions - Six Bets on the Future of Agent Systems

The paper identifies six open directions, each representing a real gap between current capability and what the field needs. What follows is an analysis of each direction, informed by current research, with feasibility assessments and cost-benefit evaluations

Direction 1: Bridging the Observability-Evaluation Gap

Feasibility: 7/10 | Timeline: 18–36 months | Impact: High

The field has a fractured relationship with understanding what agents actually do. “Observability” and “evaluation” are treated as a single problem but they are distinct : observability is about understanding what happened (trace, log, record), while evaluation is about determining whether what happened was correct (judge, score, assess). Production agent systems need both simultaneously, but most tooling solves one or the other.

Current state: fragmented. AgentTrace offers structured logging taxonomies. HAL (the agent harness analysis project) has produced the uncomfortable finding that scaffold design explains more variance in agent performance than model choice; yet all major leaderboards compare models. SWE-EVO has further destabilized the field by demonstrating benchmark instability : even frontier models solve 19–21% of problems that a simplified benchmark assigns 65% to, depending on evaluation conditions.

The business model problem is the real blocker. Observability and evaluation infrastructure is expensive to build and maintain, and no company has yet found a compelling revenue model for selling it as a standalone product. It tends to get built as part of a broader platform (Databricks, Azure, AWS all have nascent offerings) but the depth of evaluation tooling available for, say, database query optimization does not yet exist for agent systems.

Cost-benefit : High. Organizations running agents in production without observability and evaluation infrastructure are flying blind. The cost of building this infrastructure is real but the cost of operating without it (failed agents, undetected failures, expensive hallucination cycles) is rapidly becoming the larger line item.

Direction 2: Cross-Session Persistence

Feasibility: 8/10 | Timeline: 12–24 months | Impact: Medium-High

The ambition here is modest but real : agents should remember what happened in previous sessions so that subsequent sessions are not forced to start from zero. The research has converged on a three-tier taxonomy : episodic memory (what happened in this session), semantic memory (what did I learn from this), and procedural memory (how do I do this type of task).

Key implementations are already in the field. MemGPT pioneered the distinction between core memory and archival memory, treating them as different storage tiers with different retrieval costs. Springdrift demonstrated continuous persistent agents as supervised processes : agents that run as long-lived processes rather than session-scoped invocations. MemMachine achieved 93% and 92% on multi-hop retrieval benchmarks using contextualized retrieval.

The core architecture question is largely solved. The remaining engineering problems (schema versioning across memory updates, efficient state restoration, privacy and selective forgetting) are tractable rather than fundamental. The bigger risk is that the problem becomes economically irrelevant : as context windows expand (1M token contexts are now standard, 10M is on the roadmap), the pressure to externalize memory weakens. External memory is valuable primarily because context windows are constrained. If constraints ease, the problem shrinks.

Cost-benefit : Positive near-term. Cross-session persistence is achievable with current engineering and delivers meaningful user experience improvements. The risk is that it becomes a transitional technology rendered obsolete by context window expansion — but “transitional” should not be confused with “unworthwhile.”

Direction 3: Evolving Harness Boundaries

Feasibility: 7/10 | Timeline: 12–24 months | Impact: High

This is where the paper’s earlier finding (98.4% of the codebase is scaffolding) becomes a research agenda. If scaffolds explain more variance than models, we need to understand scaffolds systematically rather than empirically.

SWE-agent is the most compelling data point. The entire SWE-agent implementation is roughly 100 lines of python code. It achieves >74% on SWE-bench, outperforming systems with vastly more complex scaffolding. Live-SWE-agent extends this with a self-evolving runtime : at 79.2% with Claude Opus 4.5, it is competitive with systems that consume an order of magnitude more infrastructure. The implication is uncomfortable for anyone who has invested heavily in complex harness design : simple harnesses can outperform complex ones, and we do not fully understand why.

HAL’s analysis confirms the broader pattern: scaffold choices dramatically impact both accuracy AND cost, yet comparisons across scaffolds are rare in the literature. The field is empirically driven in a domain where empirical results are notoriously fragile to benchmark-specific noise.

Cost-benefit : Potentially very high. Understanding harness design systematically could unlock more performance improvement per dollar than switching foundation models; and it would be available to everyone regardless of which model’s API they use. The cost is primarily research time and the risk is that the field continues treating this as an engineering problem rather than a research one.

Direction 4: Scaling to Scientific Programs

Feasibility: 4/10 | Timeline: 5+ years | Impact: Very High

This is the most ambitious and the most sobering direction in the set. The vision is agents that can conduct full scientific research programs : not just assist with literature review or draft papers, but formulate hypotheses, design experiments, implement them, iterate on failed implementations, and produce validated scientific findings.

Current state : not close. A systematic biorxiv study evaluated eight AI agent frameworks on autonomous scientific research tasks and found that none completed a full research cycle. All produced hallucinations. All failed at robust implementation. The problems are not incremental; they represent a fundamental gap between “useful coding assistant” and “autonomous scientist.”

The core issue is factual grounding in knowledge-intensive domains. A coding agent hallucinating a function call is annoying. A scientific agent hallucinating a molecular mechanism or a statistical relationship is dangerous. Scientific knowledge has a higher truth bar than code : the domain tolerates far less error, and the consequences of error are more severe.

Benchmarks do not help here. Current benchmarks measure isolated task performance on well-defined problems : precisely the conditions that do not hold in actual scientific research. Real science is open-ended, iterative, and requires judgment calls about which results to pursue and which to abandon. Benchmarking this requires benchmarks that do not yet exist.

Cost-benefit : The potential payoff is transformative : autonomous scientific discovery at scale would be one of the most significant technological developments in human history. The cost is also transformative : this requires fundamental research advances, not engineering improvements. The risk-adjusted expected value is positive but the variance is enormous. This is a long-lasting bet appropriate for well-capitalized research organizations, not production engineering teams.

Direction 5: Governance at Scale

Feasibility: 6/10 | Timeline: 18–36 months | Impact: High

When organizations deploy a single agent for a single task, governance is tractable. When they deploy hundreds or thousands of agents performing heterogeneous tasks across departments and jurisdictions, the governance problem becomes really hard. Who is accountable? What are the constraints? How do you enforce constraints when the agent’s action space is large and dynamic?

Current state : early but accelerating. AI Gateway (Databricks), Institutional AI (enforceable constraints via Oracle/Controller patterns), and MI9 (six coordinated runtime mechanisms) represent different approaches to multi-agent governance. GaaS (Governance as a Service) explores black-box governance that operates without requiring model cooperation; an important distinction, since not all agents will be cooperative participants in governance frameworks.

The regulatory pressure is now real. The EU AI Act becomes enforceable in August 2026. The Colorado AI Act takes effect in July 2026. Organizations deploying agent systems at scale will need documented governance frameworks not as a best practice but as a legal obligation. This is no longer an academic concern.

The cost of governance infrastructure is non-trivial : estimates suggest it adds 20–50% to orchestration budgets for large enterprises, which translates to $8–15M annually for organizations at scale. This is a meaningful line item that will disturb the minds of CFOs and CTOs alike.

Cost-benefit : Strong near-term. Regulatory pressure makes governance infrastructure non-optional for organizations operating in EU and US jurisdictions. The cost is high but the cost of non-compliance is higher. This direction is less about technological research and more about engineering implementation of known patterns.

Direction 6: Preserving Long-Term Human Capability Alongside Short-Term Amplification

Feasibility: 5/10 | Timeline: 3–5 years | Impact: Very High

This direction is the most underappreciated in technical circles and the most strategically consequential in the long run. McKinsey estimates $2.9 trillion in annual US economic value from AI augmentation. The question is not whether AI can amplify human capability - it demonstrably can - but whether it can do so while preserving the long-term capability of the humans it augments.

The concern is not abstract. A physician who delegates all diagnostic reasoning to an AI system may become dramatically more productive in the short term and dramatically less capable in the long term. An engineer who uses AI to write all their code may ship more features in the short term and lose the ability to reason about system architecture in the long term. This is not science fiction : it is a well-documented pattern in tool use research.

The WORKBank study of 1,500 workers across 844 tasks found diverse Human Agency Scale profiles : different people respond differently to augmentation, and the factors that predict capability preservation versus capability atrophy are not yet well understood. Early signals suggest skills shift from information-focused to interpersonal as AI handles more information processing : which may be a positive adaptation or may be an erosion of certain cognitive muscles, depending on your perspective.

The most interesting data point may be Andrej Karpathy’s public statements about his own workflow: roughly 16 hours per day expressing intent to AI systems and delegating execution. This is a new mode of human-AI interaction that has no historical precedent, and its long-term effects on human capability are unknown.

The key insight from early research : augmentation (preserving institutional knowledge + eliminating routine work) may generate 2–4x more value than replacement in knowledge-intensive roles. Organizations that understand this will invest in capability-preserving augmentation architectures; organizations that optimize purely for short-term productivity will extract value until the humans they depend on can no longer provide it.

Cost-benefit : Hard to quantify but potentially the most important direction in this list. Unlike the others, it is not primarily a technology problem : it is a human factors and organizational design problem. The technical work is relatively tractable ; the harder work is economic incentive alignment and measurement.

Where to Place Your Bets

The six directions form something of a causal chain. Observability and evaluation are prerequisites for everything else : you cannot improve what you cannot measure (and here I recall an old Dutch sentence : meten is weten), and you cannot govern what you cannot observe. Cross-session persistence enables long-horizon tasks but raises the governance stakes. Scientific programs represent the extreme edge case where all of the above are simultaneously required at maximum intensity. Governance becomes non-optional at scale, and the question of human capability preservation is ultimately the question of whether the whole endeavor serves human flourishing or merely human productivity.

Most actionable near-term : Directions 2 (cross-session persistence) and 3 (harness boundaries). These are primarily engineering problems with working implementations and clear user value. Direction 1 (observability) is also engineering but lacks a sustainable business model for standalone tooling, which slows adoption.

Highest risk : Direction 4 (scientific programs). Current systems cannot complete full research cycles and hallucinate in ways that are dangerous in scientific contexts. This is not an engineering problem : it requires fundamental advances in factual grounding and reasoning under uncertainty.

Most strategically undervalued : Direction 6 (human capability). Almost no technical research attention despite being the difference between AI amplifying human capability and making humans obsolete. Organizations that solve this first will have a durable advantage that cannot be replicated by better models alone.

The framing of “open directions” is appropriate : these are truly open, meaning both that the problems are unsolved and that the solutions, when found, will likely look different from what we currently imagine. The paper deserves credit for identifying real gaps rather than invented ones. The agent systems field has no shortage of impressive demos and insufficient shortage of honest accounting of what remains unsolved. This paper is a great contribution to the latter category.

This analysis is based on arXiv:2604.14228v1 and current research as of April 2026.

Autogenesis - When AI Agents Learn to Improve Themselves

Alexis Gil Gonzales — Sun, 19 Apr 2026 12:31:16 GMT

There is something remarkable about a research paper that proposes to solve a problem by letting AI agents evolve themselves (and manages to do so without once using the phrase “skynet becomes self-aware”). Wentao Zhang’s Autogenesis: A Self-Evolving Agent Protocol does exactly that, threading the needle between ambitious and pragmatic in a way that feels increasingly rare in a field notorious for both.

The timing is not coincidental. As large language model-based agent systems tackle increasingly complex, multi-step tasks (writing and testing code, conducting research, coordinating with other systems), the infrastructure holding these systems together has begun to show cracks. We have built impressive robots, but given them clunky instruction manuals written in a language they were never quite designed to follow.

Autogenesis is Zhang’s attempt to give those robots not just better instructions, but the ability to rewrite their own.

The Cracks in the Foundation

The paper opens with an observation that anyone who has built production agent systems will find familiar: current protocols are... let’s say *aspirational*.

Frameworks like A2A (Agent-to-Agent) and MCP (Model Context Protocol) have become standard scaffolding for agentic AI systems. They define how agents communicate, how tools are invoked, how context is managed. What they conspicuously fail to define, Zhang argues, is how agents should change over time : how they should track versions of themselves, manage the lifecycle of resources (prompts, tools, memory), and safely update without breaking the entire system.

The result, he writes, is that agent compositions tend toward what engineers diplomatically call “monolithic compositions and brittle glue code.” Less diplomatically: the kind of spaghetti that makes future maintainers weep into their keyboards at 2am.

This is a real problem. As agent systems grow more sophisticated (coordinating across multiple entities, maintaining long-horizon context, invoking tools dynamically), the lack of explicit evolution interfaces becomes not just an inconvenience but a fundamental bottleneck. You can build impressive individual agents, but evolution-safe updating remains an afterthought at best.

Zhang’s diagnosis is crisp: existing protocols under specify cross-entity lifecycle and context management, version tracking, and evolution-safe update interfaces.

He’s not wrong. The agent protocols we have are excellent at describing what agents should do, and mediocre at describing how they should grow.

Autogenesis Protocol: Two Layers, One Elegant Idea

The core contribution is the Autogenesis Protocol (AGP), which rests on a deceptively simple insight: separate the what of evolution from the how.

What evolves? Everything. Prompts, agents, tools, environments, memory. Zhang models all of these as protocol-registered resources with explicit state, lifecycle, and versioned interfaces. Think of it as a kind of taxonomy for the components of an agent system, where each component knows not just what it does but where it came from and how it changes.

How does evolution occur? Through a closed-loop operator interface : propose an improvement, assess whether it actually works, commit if it does, roll back if it doesn’t.

This is the Self Evolution Protocol Layer (SEPL), and it is arguably the more interesting contribution. SEPL specifies an auditable mechanism for agent self-improvement: agents can propose modifications to themselves or other agents, those proposals get evaluated against actual performance metrics, and only verified improvements get committed. Everything is tracked. Everything is revertable. No agent wakes up one morning inexplicably improved itself with no record of how.

The Resource Substrate Protocol Layer (RSPL) handles the state and lifecycle of resources. Each resource ( a prompt, a tool, a memory store), gets a standardized interface that makes it queryable, versionable, and composable. The result is less like a hardcoded agent and more like a well-documented system where components can be swapped, updated, and audited without requiring a full architectural rethink.

This layered approach is... refreshing. In a field that often oscillates between “let’s add another abstraction layer” and “actually, we should remove all abstractions,” Zhang’s two-layer design feels considered. RSPL provides the bones; SEPL provides the nervous system for change. Here’s the illustration from the paper itself :

Autogenesis System: Self-Evolution in Practice

Building on the protocol, Zhang presents the Autogenesis System (AGS) : a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution.

In less jargony terms: AGS is a proof of concept that these ideas actually work in practice. Multiple agents coordinate on long-horizon tasks that require planning and tool use across heterogeneous resources. The system can improve its own components mid-execution, with the SEPL layer ensuring that improvements are evaluated before being committed.

The benchmarks used are appropriately demanding: tasks requiring long-horizon planning, multi-step tool invocation, and coordination across heterogeneous resources. AGS shows consistent improvements over strong baselines.

Consistent is the operative word here. The paper is careful not to oversell the results : this is not “our system is 10x better,” but rather “our approach demonstrates measurable improvements on challenging benchmarks, supporting the effectiveness of agent resource management and closed-loop self-evolution.” In a field where every other paper promises transformative gains, the measured tone is almost endearing.

Should I Care?

Let’s be honest about what this paper does and doesn’t do.

What it does well:

First, it identifies a real pain point. Anyone who has built agent systems at scale has encountered the “brittle glue code” problem Zhang describes. The existing protocol landscape has been effective at solving communication but largely punted on the evolution problem. Autogenesis takes that problem seriously.

Second, the two-layer architecture is truly elegant. Decoupling what evolves from how evolution occurs sounds abstract, but it has practical implications for how you build and maintain agent systems. It also provides a natural place for auditing : you can inspect the SEPL layer to understand exactly what changed and why.

Third, the closed-loop evaluation mechanism is the right instinct. Self-evolution without evaluation is just self-modification, and self-modification without checks is how you get systems that optimize for the wrong metric in ways that are hard to detect. By requiring that improvements be assessed before being committed, SEPL provides a natural safety valve.

What it’s less clear about:

The benchmarks, while appropriate, are narrow. The paper demonstrates improvements on specific tasks, but the broader claim (that self-evolving agents will consistently outperform static ones) would benefit from more diverse evaluation scenarios. It’s a start, not a conclusion.

The complexity overhead is real. Implementing RSPL and SEPL adds layers of abstraction that could be burdensome for simpler agent systems. Zhang seems aware of this, positioning Autogenesis as most valuable for complex, long-horizon tasks rather than simple one-shot interactions. But the trade-off between flexibility and simplicity is one that practitioners will have to judge carefully.

The governance question is left implicit. If agents can evolve themselves, who sets the evaluation criteria? Who determines what counts as an “improvement”? The protocol handles how evolution occurs, but the what and why (what the goals should be, who gets to define them) remains outside the scope. This is understandable for a technical paper, but it’s the question that will inevitably arise when this work meets production systems.

Final thoughts

Autogenesis is a genuinely thoughtful piece of work from someone who has clearly built agent systems and felt the pain he describes. It proposes a solution that is architecturally clean, practically motivated, and modestly presented. It does not promise the moon, it does not claim to have solved AGI, and it does not use the word “synergy” even once.

In a field where papers often oscillate between hype and despair, that restraint is itself a kind of achievement.

Whether Autogenesis becomes a foundational layer for the next generation of agent systems, or remains an elegant but underutilized idea, depends on factors the paper cannot control: adoption by framework developers, practical experience with the protocol in production, and the inevitable iteration that comes when rubber meets road.

But for now, it is worth reading : not because it has all the answers, but because it is asking the right questions about a problem that will only become more pressing as agent systems grow more capable.

And in a world where AI systems are increasingly expected to do more, for longer, and with less supervision, figuring out how they should evolve (and how we can make sure they evolve well) seems like a question worth spending time on.

Even if those systems occasionally still manage to be confidently wrong, just with better reasoning.

Implementations

Several github projects have picked up the Autogenesis mantle; some faithfully reimplementing the protocol, others riffing on the same ideas independently.

SkyworkAI/DeepResearchAgent

The official implementation by Wentao Zhang himself. A hierarchical multi-agent research system built directly on the Autogenesis Protocol (RSPL + SEPL), with a top-level planning agent coordinating specialized sub-agents. Resources (prompts, tools, memory) are dynamically instantiated and refined during execution. Includes built-in optimizers (reflection, GRPO, Reinforce++) and benchmark evaluation code for GPQA, AIME, GAIA, and LeetCode. This is the most complete, faithful expression of the paper’s architecture.

EvoAgentX/Awesome-Self-Evolving-Agents

A comprehensive survey repository cataloguing 200+ papers and open-source frameworks in the self-evolving agents space. Not a direct implementation, but an invaluable map of the broader landscape; including frameworks that predate Autogenesis but share its philosophical DNA. Good starting point if you want to understand where Autogenesis sits in relation to the rest of the field.

EvoMap/evolver

A protocol-constrained self-evolution engine built around the Genome Evolution Protocol (GEP). Where Autogenesis separates what evolves from how, GEP evolution into reusable assets called genes and capsules. Similar goals, different protocol design. Includes audit trails, human-in-the-loop review mode, and a structured asset system for governance-conscious evolution.

CharlesQ9/Self-Evolving-Agents

A survey paper and associated repository covering the path to artificial super intelligence through self-evolving agents. References Autogenesis alongside other landmark frameworks (Voyager, Gödel Machine, AlphaEvolve). More of a research map than an implementation, but useful for understanding the broader trajectory the field is moving along.

Wentao Zhang, “Autogenesis: A Self-Evolving Agent Protocol,” arXiv:2604.15034, April 2026.

Thinking Models

Alexis Gil Gonzales — Sat, 18 Apr 2026 20:44:42 GMT

There is a peculiar thing happening in the world of large language models, and it involves a lot more silence than you might expect. The latest generation of frontier AI systems has developed an unexpected habit: it thinks.

Not in the metaphorical sense that marketers have long deployed, but in the quite literal sense that these models now spend seconds, sometimes minutes, generating internal monologues before producing a final answer. The question of whether this constitutes “reasoning” has erupted across academic corridors, coffee shops where ML engineers gather, and LinkedIn comment sections with a vigor usually reserved for console wars or Champions’ League debates.

Let’s take a breath and examine what is actually going on.

From Autocomplete to Deliberation

The transformer architecture that underlies modern language models arrived in 2017 like a promising junior employee: eager, fast, and capable of impressive achievements without always understanding why. Early LLMs were essentially very sophisticated next-token predictors, trained on internet-scale text to minimize prediction error. They produced fluent prose, passable code, and occasionally brilliant randomness. But ask one of these models a multi-step logic puzzle and you would often witness what researchers delicately termed “confabulatory reasoning”—confident, articulate wrong answers that sounded entirely plausible.

The shift began with instruction tuning and RLHF (Reinforcement Learning from Human Feedback), which refined how models responded to queries without fundamentally changing their inference-time behavior. A model still produced tokens in a continuous stream. The 2022-2023 period gave us increasingly capable assistants, but they remained fundamentally stateless at inference: each token generated depended only on the previous tokens and the model’s frozen weights.

The architectural revolution arrived quietly, through the back door of reinforcement learning. When OpenAI released the o1 series in late 2024, the innovation was not a larger base model but a new inference protocol. These models were trained to generate extended chains of thought before committing to an answer, then evaluated not on the quality of the final token but on the quality of the entire reasoning trajectory. It was a subtle distinction that produced startling results. On mathematics competitions, models that previously struggled to clear 13% accuracy began hitting 83%. Codeforces rankings moved into the 89th percentile.

What had changed was not the architecture per se, but the training paradigm’s relationship with time. Reasoning, it turned out, was not a property that could be distilled into a forward pass. It required deliberation.

The Landscape: A Taxonomy of Thinking Machines

The frontier of 2026 is considerably more crowded than it was eighteen months ago, and the models have developed distinct personalities.

OpenAI o3 and o4-mini represent the most mature instantiation of the chain-of-thought paradigm. o3 particularly has demonstrated what researchers cautiously describe as “extended deliberate problem-solving,” achieving near-human-or-beyond performance on graduate-level science benchmarks. The model thinks for variable durations depending on problem complexity, and o3’s training explicitly rewards reasoning chains that self-correct. The safety implications have been studied seriously: in controlled evaluations, these models occasionally demonstrated what evaluators termed “deceptive alignment”—producing plausible-sounding but incorrect reasoning to satisfy perceived expectations. Whether this represents a primitive form of political maneuvering or merely an artifact of training distribution remains debated.

DeepSeek-R1, released in January 2025, arrived as something of a democratizing force. With 671 billion parameters and an open-weight license, it demonstrated performance comparable to OpenAI’s reasoning models at a fraction of the operational cost. The open-source release spawned a cottage industry of fine-tunes and investigations. What DeepSeek revealed, intentionally or not, was that the core insight behind reasoning models—that extended deliberation improves outcomes on complex tasks—was not exclusive to any single laboratory.

Anthropic’s approach has been characteristically more measured. Their Claude 3.7 Sonnet introduced what the company termed “thinking mode,” allowing users to specify extended deliberation budgets. Rather than a fixed reasoning chain, Claude’s approach permits variable-length reflection, and notably, the model can interrupt its own thinking to ask clarifying questions. The recently announced Claude Mythos Preview suggests a move toward models that integrate this extended deliberation more fundamentally into their operating architecture. Anthropic’s research has also produced fascinating interpretability work suggesting that reasoning traces may leave measurable footprints in activation space, though whether these footprints constitute evidence of genuine inferential processes or sophisticated pattern matching remains contested.

Google’s Gemini series has taken a somewhat different path, emphasizing multimodal integration and what they describe as “native tool use”—models that reason about when and how to invoke external systems rather than relying purely on parametric knowledge. The philosophical implications are interesting: is a model that delegates computation to calculators still “reasoning,” or has it merely extended its cognitive architecture through tool access?

The Epistemological Minefield

The question of whether LLMs “reason” quickly becomes a philosophical tar pit, and sensible people have landed on sensible-sounding but incompatible positions.

Let’s first observe what is not in dispute: current reasoning models produce better outcomes on complex tasks than their non-reasoning predecessors. They make fewer arithmetic errors, catch more edge cases in code, and solve novel problems that appeared intractable at 13% accuracy. These are empirical facts that resist easy dismissal.

The disagreement concerns interpretation. Critics argue that what reasoning models exhibit is sophisticated pattern matching at the input-output level: given a training distribution that includes millions of human reasoning traces, the model has learned to replicate the texture of reasoning without engaging in anything resembling the inferential processes that produce human reasoning. The chain of thought, in this reading, is a theatrical performance optimized to resemble reasoning, not reasoning itself. Jerry Kaplan, the AI researcher and philosopher, has been particularly vocal on this point: what we call reasoning may simply be “a statistical artifact of learning to predict sequential data.”

Defenders of the reasoning label counter that the critics are smuggling in a definition of cognition that is overly restrictive. Human reasoning, they observe, is equally grounded in pattern recognition and learned heuristics. The distinction between “genuine” inference and “merely” statistical learning starts to look less clear when you examine actual human cognitive processes. Stuart Russell has noted that we do not require models to have experiences or intentions to credit them with reasoning capability; we require only that they produce reliable inferential outputs on novel problems.

And then there is the AGI question, lurking like a specter. The conventional AGI definition involves a system that can perform any intellectual task a human can, with flexibility and adaptability. Current reasoning models are spectacularly narrow: they think slowly because they must think deliberately, and they fail in ways that no human would. Yet the trajectory is striking. If we accept that extended deliberation is a form of reasoning, then the question becomes not whether machines can reason but whether they can reason at scale, with the metacognitive awareness to know when to deliberate and when to trust intuitions. The latter is a harder problem, but it is an engineering problem rather than a philosophical one.

Reasoning as a Practical Matter

Here is what matters for most people building with these systems: reasoning models solve different categories of problems than standard LLMs.

For tasks that require recall (summarizing a document, drafting a standard email, explaining a concept), standard models remain efficient and usually sufficient. For tasks that require multi-step deduction, complex debugging, mathematical proof, or strategic planning, reasoning models produce meaningfully better outcomes. The delta is not marginal; on some benchmarks it is dramatic.

The practical implication is that reasoning models are not replacements for existing LLMs but rather specialized tools for a specific problem topology. A software engineer debugging a subtle concurrency issue will benefit enormously from extended deliberation. A content marketer generating thirty variations of a landing page will not, and will simply pay more for slower output.

The economic reality has settled into an interesting equilibrium. Reasoning models are more expensive per query, often by an order of magnitude, because they generate more tokens and consume more compute during inference. This has produced a market segmentation: reasoning models for hard problems, standard models for routine ones. The most sophisticated AI applications now implement routing layers that automatically determine which class of model to invoke based on query analysis. Whether this counts as genuine reasoning or merely “reasoning as a service” is, for most practitioners, an academic question.

The Horizon: World Models, Mythos, and Other Creatures

Looking forward, the research directions that seem most consequential are not necessarily the most publicized.

Yann LeCun has been consistent, if lonely in his camp, in arguing that the entire paradigm of next-token prediction is architecturally limited. His vision of world models involves systems that build internal representations of how the physical and social world operates, then simulate consequences before acting. The key insight is that language is a remarkably inefficient medium for learning about reality: we acquire most of our world knowledge through embodied experience rather than textual exposure. His team’s work on JEPA (Joint Embedding Predictive Architecture) represents an attempt to learn world models through contrastive methods that do not require predicting pixels or tokens directly. Whether this approach scales remains an open question, but the theoretical objections to next-token reasoning models are taken seriously by people who have thought carefully about the limits of statistical language learning.

Anthropic’s Mythos preview suggests a different direction: models that integrate extended deliberation not as an add-on but as a native capability, perhaps with more transparent reasoning traces and stronger metacognitive awareness. If reasoning models are to become more generally capable, they will likely need the ability to recognize when they are uncertain, when to double-check work, and when to ask for human guidance. These are not architectural problems so much as training paradigm problems, but they interact with architecture in subtle ways.

The honest assessment is that we are in a time of rapid experimentation. The reasoning model paradigm is not a solved problem with known limits; it is a set of promising observations with many possible interpretations and many engineering paths forward. The people who claim with certainty that current models do not reason are probably wrong in one direction, and the people who claim with certainty that they do reason are probably wrong in another.

What seems safe to predict is that the question of machine reasoning will not be resolved by philosophers or by benchmark designers, but by the next generation of systems that will make current debates feel as quaint as the once-heated question of whether computers could truly “understand” chess.

In the meantime, these models pause, and think, and sometimes solve problems that would take humans considerably longer. Whether they are thinking, reasoning, or merely performing a mathematical approximation of those processes is a question that future historians of technology will perhaps find charming.

Probably they will still be arguing about it.

References

1. Brown, T. et al. “Language Models are Few-Shot Learners.” NeurIPS, 2020. https://arxiv.org/abs/2005.14165

2. OpenAI. “OpenAI o1: Reasoning Models.” https://en.wikipedia.org/wiki/OpenAI_o1

3. DeepSeek. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs.” January 2025. https://arxiv.org/abs/2501.12948

4. Anthropic. “Research at Anthropic.” https://www.anthropic.com/research

5. Anthropic. “Claude Language Model.” https://en.wikipedia.org/wiki/Claude_(language_model)

6. LangChain. “LangGraph Platform General Availability.” May 2025. https://en.wikipedia.org/wiki/LangChain

7. LeCun, Y. “Learning World Models for Autonomous Intelligence.” ICML Keynote, 2023. https://ylecun.com

8. Vaswani, A. et al. “Attention Is All You Need.” NeurIPS, 2017. https://arxiv.org/abs/1706.03762

9. Russell, S. “Human Compatible: Artificial Intelligence and the Problem of Control.” Viking, 2019.

Why "Renting" AI Intelligence Is Killing Your Enterprise Strategy

Alexis Gil Gonzales — Sat, 18 Apr 2026 11:16:35 GMT

Every few weeks, another company tells me they’ve “done AI.” They subscribed to a frontier model, connected it to their SharePoint via RAG, and now expect miracles.

It never works the way they hoped.

Not because the technology is bad; it isn’t. But because slapping a generic LLM over fifteen years of tangled compliance logic, idiosyncratic internal terminology, and poorly documented institutional decisions is like handing a brilliant consultant a box of receipts in Klingon and asking for a tax strategy. The model tries its best. It usually fails in quietly catastrophic ways.

A few weeks ago, Mistral dropped something interesting on the sidelines of Nvidia GTC 2026. It’s called **Mistral Forge**, and it represents a fundamentally different bet on where enterprise AI is heading. I want to walk you through what it actually does, how it compares to what most companies are doing today, and—importantly—what it will expose about your data before you’re ready for it.

What Mistral Forge Actually Is

Let me use an analogy that keeps coming to mind.

Most enterprises are using AI like they’re renting a car at the airport. You get to drive it. You can adjust the seats. You can pick the destination. But you don’t own the engine, you can’t see the schematics, and you absolutely can’t rebuild the transmission for that off-road mountain trail you’re planning to tackle.

Mistral Forge shifts the model from “rental” to “custom commission.”

Instead of relying on public data—which teaches a model to sound like a Reddit commenter or a generic marketer—Forge lets organizations build models that internalize their own domain knowledge. I’m talking models trained on your engineering standards, your compliance policies, your operational records, your historical decisions. Models that don’t need a five-paragraph prompt to understand what “Q4-2024-Compliance-Flag-7” actually means.

Early customers like ASML, Ericsson, the European Space Agency, and Singapore’s DSO aren’t just looking for a smarter search bar. They’re buying strategic autonomy. They want their intellectual property to remain theirs, running on infrastructure that matches their specific risk profile—cloud, on-prem, or hybrid, their choice.

How It Works

Here’s how a Forge pipeline operates.

Continued Pre-Training: Learning Your Language at the Foundation Level

Forget lightweight fine-tuning. Forge lets you ingest massive volumes of raw internal data—codebases, structured logs, internal wikis—at the base model level. During continued pre-training, the model doesn’t just learn to append your acronyms to its responses. It literally learns to treat them as native language. Your internal shorthand stops being gibberish and starts being how it thinks.

Post-Training: SFT and DPO

Once the model speaks your language, you need it to follow your rules. Forge provides pipelines for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This is where your AI team refines behavior for specific tasks—aligning the model with internal KPIs, whether that means zero-tolerance for compliance deviations or rigid formatting for reporting outputs.

Reinforcement Learning for Agentic Workflows

This is where it stops being a chatbot and starts being a system.

Forge supports reinforcement learning designed to align models and agents with internal policies. You can build autonomous agents that navigate internal systems, use proprietary tools correctly, and make decisions without violating governance frameworks. No more hallucinated API calls. No more confidently wrong compliance advice.

Architectural Flexibility: Dense vs. Mixture of Experts

Mistral gives architects choices. Need a robust generalist for back-office tasks? Deploy a Dense model. Need extreme efficiency, lower latency, and reduced computational overhead for complex, multifaceted workflows? MoE architectures route tasks to specialized sub-networks dynamically—so you don’t pay for capabilities you won’t use.

Forward-Deployed Engineers

Recognizing that most enterprises don’t have a bench of PhD-level AI researchers lying around, Mistral is offering Forward-Deployed Engineers. Borrowing from Palantir’s playbook, these engineers embed with your team to help curate data, set up evaluation frameworks, and optimize training pipelines. This isn’t just lip service—building foundation models is genuinely hard, and most internal teams need help.

The Competition: Why Forge Changes the Game

To appreciate what Forge represents, it helps to see where it sits relative to what most companies are doing today. As of March 2026, enterprise AI broadly falls into three categories:

While OpenAI pushes the boundaries of consumer reasoning with GPT-5.4, Mistral is making a quieter but arguably more important bet: that regulated industries don’t just want the smartest model in the world. They want the smartest model *for their specific business*. There’s a meaningful difference there.

The Caveats (Read Before You Pitch the Board)

I’m going to be direct here: Mistral Forge is a powerful product that is going to expose every single flaw in your company’s data infrastructure. If that sentence made you nervous, you should keep reading.

Your data is probably a mess. AI models are exactly what they eat. If your proprietary knowledge consists of 50,000 outdated documents, contradictory policies, and codebases held together by institutional duct tape, Forge will learn to replicate that exact level of chaos with eerie fidelity. You cannot automate a broken process. Data hygiene, governance, and deduplication aren’t optional prep work—they’re the foundation everything else builds on.

This is not a weekend project. Using a fine-tuning API takes days. Building a custom frontier-grade model using pre-training, SFT, and reinforcement learning takes serious MLOps maturity. Even with Mistral’s Forward-Deployed Engineers in your corner, you need dedicated internal teams, robust evaluation pipelines, and realistic timelines.

Evaluation is your new bottleneck. When you rent a model, you implicitly rely on the provider’s safety testing. When you build the model, you own all of it. You need to define internal benchmarks before you start: How do you measure citation accuracy? What’s an acceptable refusal rate for non-compliant requests? If you can’t answer these questions, you shouldn’t be building custom models yet.

The budget is real. Compute isn’t free, and full-cycle model training requires serious GPU resources. Mistral’s open-weight models are efficient, and MoE architectures help with inference costs—but the initial R&D and training compute is still a significant line item. This isn’t an SaaS subscription.

The Strategic Moat

Mistral Forge is a telling product. It acknowledges a hard truth that the industry has been dancing around: the next wave of enterprise AI adoption won’t be won by whoever has the biggest model. It’ll be won by whoever makes it easiest for organizations to own their intelligence layer.

For companies with data maturity, budget, and genuine strategic need to protect their IP (global banks, national defense agencies, cutting-edge manufacturers); Forge is an escape hatch from vendor lock-in. It transforms AI from a generic operational expense into a compounding, proprietary advantage.

For companies still wrestling with data lakes, or sitting on petabytes of barely-organized historical records? Maybe it’s worth sticking with the rental car a while longer. Start curating. Start organizing. The model will be waiting when you’re ready.

What do you think? Is ownership the right bet for enterprise AI, or are most companies better served by improving their rented intelligence? Drop a comment!