<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Aleph Zero]]></title><description><![CDATA[From the first token to infinite intelligence. Exploring the convergence of AI and the aleph that contains all points.]]></description><link>https://blog.aleph-tech.com</link><image><url>https://substackcdn.com/image/fetch/$s_!Awpm!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73d04aa-89d2-433b-a6bb-f756b438e6ce_600x600.png</url><title>Aleph Zero</title><link>https://blog.aleph-tech.com</link></image><generator>Substack</generator><lastBuildDate>Thu, 07 May 2026 08:40:12 GMT</lastBuildDate><atom:link href="https://blog.aleph-tech.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Alexis Gil Gonzales]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[alexisgilg@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[alexisgilg@substack.com]]></itunes:email><itunes:name><![CDATA[Alexis Gil Gonzales]]></itunes:name></itunes:owner><itunes:author><![CDATA[Alexis Gil Gonzales]]></itunes:author><googleplay:owner><![CDATA[alexisgilg@substack.com]]></googleplay:owner><googleplay:email><![CDATA[alexisgilg@substack.com]]></googleplay:email><googleplay:author><![CDATA[Alexis Gil Gonzales]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Affordable AI Workstations in 2026 - A Practical Guide to Running Large Language Models Without Going Bankrupt]]></title><description><![CDATA[The Quiet Revolution of Local LLMs]]></description><link>https://blog.aleph-tech.com/p/affordable-ai-workstations-in-2026</link><guid isPermaLink="false">https://blog.aleph-tech.com/p/affordable-ai-workstations-in-2026</guid><dc:creator><![CDATA[Alexis Gil Gonzales]]></dc:creator><pubDate>Mon, 04 May 2026 19:28:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aZr9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aZr9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aZr9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!aZr9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!aZr9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!aZr9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aZr9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7692544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alexisgilg.substack.com/i/196388032?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aZr9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!aZr9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!aZr9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!aZr9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe72248c3-8dc5-45c8-9e39-2f4b16f200dc_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>I&#8217;ve been making some research on a hardware upgrade we want to make to our (tiny) datacenter.  wo years ago, running a competent language model locally meant making peace with tokens, cryptic setup rituals, and the constant threat of an out-of-memory crash. That era is decisively over. In 2026, you can run a 70-billion-parameter model on hardware that fits on a desk, sips under 200 watts, and costs less than a used car.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Aleph Zero! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Before going further, a distinction that will matter throughout this article: most models discussed are <em>dense</em> models, where every forward pass activates all parameters. <em>Mixture-of-Experts</em> models work differently. A 109-billion-parameter MoE like Llama 4 Scout only activates about 17 billion parameters per token, since each expert handles a slice of the workload. The total model size is still large and needs to live in memory, but the active compute per token is much smaller. This means MoE models fit the memory-capacity constraints of unified-memory platforms far better than their dense cousins at a given parameter count, while delivering competitive quality.</p><p>The democratization of large language models is no longer a distant promise. It is happening right now, on consumer-grade hardware, in home offices and small labs. The technology has matured to the point where the barrier to entry is not technical sophistication but simply knowing which hardware to buy and what it can actually do.</p><p>This article surveys three categories of affordable AI workstations available in 2026: the <strong>AMD Ryzen AI Halo</strong> platform built around the Strix Halo APU, NVIDIA&#8217;s compact <strong>DGX Spark</strong>, and the evergreen option of discrete <strong>RTX consumer GPUs</strong>. We will decompose what LLM workflows actually demand from hardware, match each platform to the tasks it handles well, and close with a look at where this hardware race is heading, including a noteworthy newcomer from <strong>Apple</strong>.</p><div><hr></div><p></p><h2>What Local LLMs Actually Need - Hardware Decomposition</h2><p>Before comparing machines, it helps to understand what running a language model actually requires at each stage. LLM workflows fall into four broad categories, each with very different hardware demands.</p><p><strong>Inference</strong> is the simplest operation: you give the model a prompt, it generates a response. The model weights must live in memory. At full 16-bit precision, each billion parameters consumes roughly 2 gigabytes. A 7-billion-parameter dense model needs about 14 gigabytes just for its weights in FP16. A 70-billion-parameter dense model needs 140 gigabytes, which no consumer graphics card can provide. A 109B MoE model at FP16 would need 218 gigabytes total, but only 17 billion parameters activate per forward pass, making the active working set far smaller than the stored size suggests.</p><p>The memory requirement to simply store weights is separate from the compute requirement to run them. <em>This distinction matters enormously for hardware selection: </em>a platform that cannot store a dense 70B model may still handle a 100B+ MoE model because the hardware only needs to process the active expert parameters during each forward pass, even though all parameters must remain in memory.</p><p><em>Quantization</em> solves this. By storing weights in 4-bit or 8-bit precision, you shrink model sizes dramatically. The community-standard Q4_K_M quantization fits a 7-billion-parameter dense model in about 4 to 5 gigabytes and a 70-billion-parameter dense model in 35 to 40 gigabytes. MoE models behave differently: a 109B MoE like Llama 4 Scout at Q4_K_M occupies roughly 60 gigabytes total, but only 17B parameters activate per forward pass, making it far more tractable on bandwidth-constrained hardware than a dense 70B despite having more total parameters. This is how consumer hardware became viable for large models: not through more VRAM, but through smarter weight representation and architectural choices that reduce active compute per token.</p><p><em>Context length</em> compounds memory needs through the KV cache, which stores attention state for every token in the context window. A 7-billion-parameter model at FP16 precision needs roughly 1 gigabyte for a 4K context and 8 gigabytes for a 32K context. Push to 128K and you are looking at 32 gigabytes just for the cache on that same 7B model.</p><p><strong>Fine-tuning</strong> adjusts a pre-trained model&#8217;s weights for a specific task. Full fine-tuning, which updates every parameter, is extraordinarily memory-hungry. Beyond the model weights themselves, you need to store gradients (the direction each weight should move) and optimizer states (the AdamW optimizer keeps a running estimate of gradient moments in 32-bit precision). For a 7B model at FP16, this multiplies to roughly 70 to 84 gigabytes total. That is a datacenter workload.</p><p><em>LoRA</em> and <em>QLoRA</em> changed the math. LoRA freezes the base model and trains tiny adapter matrices inserted between layers. The adapters are typically 0.1 to 1 percent of the total parameter count, so a 7B model might only need 16 to 20 gigabytes with LoRA at FP16. QLoRA goes further by also quantizing the frozen base model to 4-bit NF4 format, dropping the 7B requirement to 6 to 10 gigabytes. A 70B model that would need 600 gigabytes for full training fits in 32 to 48 gigabytes with QLoRA.</p><p><strong>Knowledge distillation</strong> trains a smaller student model to replicate the behavior of a larger teacher model. The teacher runs in inference mode while the student is trained on its outputs. The key nuance is that the student model is typically one-fifth to one-tenth the size of the teacher, so its training cost is dominated by the student architecture, not the teacher. However, the teacher must remain in memory during distillation, adding an inference-time memory overhead. Compared to full training from scratch, distillation is less demanding because the student is small and usually already pretrained on general data, so fewer training steps and a smaller dataset are needed to achieve the target capability. Compared to full training of a same-sized model from scratch, distillation is cheaper still: you are not updating all parameters from random initialization against a full pre-training corpus.</p><p><strong>Full training from scratch</strong> belongs in an entirely different category. Training a new 7B model from initialization requires hundreds of gigabytes across gradients, optimizer states, activations, and weights. This is the domain of GPU clusters with terabytes of HBM memory. We mention it only to draw a clear line: no workstation in this article is designed for this, and anyone suggesting otherwise is selling fantasy.</p><div><hr></div><p></p><h2>The Three Platforms</h2><h4>AMD Ryzen AI Halo </h4><p>AMD&#8217;s Ryzen AI Halo platform, built around the Ryzen AI Max+ 395 processor, is the newest entrant to this market. At its core is a 16-core Zen 5 CPU with an integrated GPU featuring 40 RDNA 3.5 compute units and an NPU rated at 50 TOPS. The standout feature is its <em>unified memory architecture</em>: up to 128GB of LPDDR5X memory shared between the CPU, GPU, and NPU, with a 256-bit memory bus delivering around 212 GB per second in practice.</p><p>AMD&#8217;s Variable Graphics Memory technology lets you dedicate up to 96 gigabytes of that pool as GPU-addressable VRAM. This is the critical advantage: no consumer discrete GPU comes close to 96 GB of video memory. A 128-gigabyte Strix Halo system running a dense 70-billion-parameter model quantized to Q4 sits comfortably in that headroom while leaving 32 gigabytes for the operating system and tooling. An MoE model of similar total parameter count fits just as easily, since the memory footprint is comparable but the active compute per token is lighter.</p><p>The trade-off is bandwidth. At roughly 212 GB per second, Strix Halo is memory-bandwidth-bound for dense 70B models, producing 3 to 5 tokens per second for large dense models. Mixture-of-Experts models like Llama 4 Scout perform better since they only activate a fraction of parameters per forward pass. For smaller models that fit within the bandwidth budget, token rates are competitive with discrete GPUs.</p><p>The software story has matured significantly. ROCm 7.2, released in late 2025, brought official PyTorch support on Linux and public preview support on Windows. Most notably, AMD confirmed in January 2026 that ROCm is now a first-class platform for vLLM. llama.cpp runs on AMD GPUs through both ROCm and Vulkan backends, with community testing consistently finding Vulkan the more reliable and often faster path for Strix Halo APUs.</p><p>Pricing is the platform&#8217;s strongest card. Fully configured mini-PCs with 128 GB of memory and a Ryzen AI Max+ 395 are available for around $2,500 depending on the vendor, with the Framework Desktop and Beelink GTR9 Pro representing the most polished options. AMD&#8217;s own reference platform, launching mid-2026, is expected in the $2,000 to $3,000 range.</p><h4>NVIDIA DGX Spark</h4><p>NVIDIA&#8217;s DGX Spark, formerly known as Project DIGITS, is the smallest member of the DGX family but shares the same software stack as its datacenter siblings. It is built around the GB10 Grace Blackwell Superchip, pairing a 20-core Arm CPU with a Blackwell GPU featuring 6,144 CUDA cores and fifth-generation Tensor Cores. The result is a system capable of 1 PetaFLOP at FP4 sparse precision.</p><p>The DGX Spark ships with 128 GB of unified LPDDR5X memory at 273 GB per second bandwidth, running NVIDIA&#8217;s DGX OS 7.4, which is Ubuntu 24.04 with CUDA 13 and a curated AI software stack pre-installed. PyTorch with Blackwell optimizations, TensorRT-LLM, vLLM, Ollama, and Docker with NVIDIA Container Runtime all come configured out of the box. If you want to prototype on a compact desktop machine and deploy to an H100 cluster, the software environment is identical on both.</p><p>Inference scales to 200 billion parameters at FP4 quantization on a single unit, with two DGX Spark units linked via ConnectX-7 at 200 gigabits per second capable of running 405 billion parameter models. Fine-tuning supports full fine-tuning up to 8 billion parameters at 16K context length and QLoRA up to 70 billion parameters. The CUDA ecosystem, torch.compile support, and native Docker GPU passthrough make this the most capable platform for developers working across training, fine-tuning, and inference.</p><p>The current price is $4,699 for the Founder&#8217;s Edition with 4 TB of NVMe storage, up from the $2,999 announcement price and $3,999 reservation price, reflecting the realities of the memory market in early 2026.</p><h4>Discrete RTX GPUs</h4><p>The traditional path to local AI compute remains relevant in 2026, now powered by the RTX 50 series Blackwell architecture alongside capable used RTX 40 series hardware.</p><p>The RTX 5090 leads the consumer lineup with 32 gigabytes of GDDR7 memory, 1,792 GB per second bandwidth, and 3,352 AI TOPS. At $1,999 MSRP, it is the only single consumer GPU that can fit a dense 70B model at Q4 with meaningful context headroom. MoE models are a different story: a 200B+ MoE at Q4 may need 80 to 100 gigabytes just for weights, and while the active compute per token is smaller, the total storage requirement is the bottleneck for loading. The RTX 5090 is PCIe 5.0, which matters more for fine-tuning data movement than for inference.</p><p>The RTX 5070 at $549 MSRP has displaced the RTX 4070 as the budget sweet spot. With 12 GB of GDDR7 and roughly 35 to 45 percent faster inference performance than the RTX 4070 Ti, it handles 7B and 13B models comfortably and fits 14B at Q4.</p><p>The used market offers compelling alternatives. The RTX 4090 at $700 to $900 used delivers 24 GB of GDDR6X and remains the best single-GPU option for 30B+ models. The older RTX 3090 at $450 to $600 used is the best VRAM-per-dollar option for running larger models on a budget, trading some speed for the 24-gigabyte ceiling.</p><p>The RTX platform&#8217;s advantage is ecosystem breadth. CUDA, PyTorch, TensorRT, and every popular inference framework have been optimized for NVIDIA consumer GPUs for over a decade. The community knowledge base is unparalleled. The disadvantage is the same as always: discrete VRAM is finite and expensive to expand. A 70B model at high context lengths will simply refuse to load on any single consumer GPU.</p><div><hr></div><p></p><h2>Matching Platforms to Workflows</h2><p><strong>Pure inference on 70B+ models</strong>: Strix Halo and the DGX Spark have no competition among consumer-class hardware here. Neither can match the token throughput of a bandwidth-rich discrete GPU on smaller models, but both can load a dense 70B Q4 model that a single RTX 5090 can only fit at tighter quantizations or with CPU offloading. MoE models extend this advantage further: Strix Halo&#8217;s 96 GB VGM allocation accommodates MoE models up to roughly 200 billion total parameters at Q4, delivering usable token rates despite the massive total size, because only a fraction of parameters activate per forward pass. Strix Halo wins on dollar value; the DGX Spark wins on CUDA ecosystem and fine-tuning capability.</p><p><strong>Development, fine-tuning, and CUDA workflows</strong>: The DGX Spark is the default recommendation. The out-of-the-box software stack, torch.compile support, and seamless path from prototype to datacenter deployment make it the most serious development workstation in this comparison. If you are writing training code, building agents, or iterating on fine-tuning recipes, this is the machine that will not fight you.</p><p><strong>Speed-first inference on smaller models</strong>: An RTX 5090 or RTX 4090 used setup wins on tokens per second for any model that fits in its VRAM. For 7B through 34B models at Q4 with 8K to 32K context, discrete GPUs generate tokens 3 to 4 times faster than unified-memory platforms at similar cost.</p><p><strong>Fine-tuning on a budget</strong>: Strix Halo at $800 to $1,500 is surprisingly capable with QLoRA. A 7B model fine-tune fits in 6 to 10 gigabytes. A 13B model needs 10 to 16 gigabytes, still within Strix Halo&#8217;s 96-gigabyte VGM allocation. The DGX Spark handles LoRA and QLoRA comfortably and adds full fine-tuning for 8B models. RTX 4090 used remains competitive for LoRA on 7B to 13B models.</p><p><strong>Portable workstation</strong>: The DGX Spark&#8217;s 1.2-kilogram footprint and sub-200-watt power draw make it the most capable machine that can live permanently on a desk without becoming furniture. Strix Halo mini-PCs are similar in this regard. Neither is laptop-class, but both are far more desk-friendly than a traditional workstation with a full-length graphics card.</p><div><hr></div><p></p><h2>Hardware Evolution and the Road Ahead</h2><p>The unified memory architecture that started with Apple&#8217;s M1 Ultra in 2021 has become the defining trend in AI workstation design. Both the DGX Spark and AMD Strix Halo have followed Apple&#8217;s lead, confirming that the traditional separation between system RAM and GPU VRAM is an artifact of PCI Express bandwidth constraints, not an engineering necessity.</p><p>This convergence is happening because LLM inference does not map cleanly onto either CPU or GPU design points. The attention mechanism is memory-bandwidth-bound rather than compute-bound, which means raw shader FLOPS matter less than memory capacity and bandwidth. Unified memory eliminates the PCIe bottleneck at the cost of sharing bandwidth between compute and memory, a trade-off that increasingly favors large model capacity over raw speed.</p><p>The other significant trend is the maturation of quantization. What started as a desperate measure to fit large models into small VRAM has become a first-class inference technique. Q4_K_M is now the community standard, and frameworks like llama.cpp and vLLM handle it with no user intervention required. The average practitioner no longer needs to understand the difference between GPTQ and AWQ internals to run a 70B model on consumer hardware.</p><div><hr></div><p></p><h2>A Special Note on Apple M5 Max</h2><p>No survey of AI workstations would be complete without acknowledging Apple&#8217;s Silicon trajectory, even though the M5 Max sits outside the three platforms we have focused on.</p><p>The M5 Max, announced in March 2026, introduces a dual-die 3-nanometer Fusion Architecture with up to 40 GPU cores each containing a dedicated Neural Accelerator, in addition to a 16-core Neural Engine. With up to 128 GB of unified memory at 614 GB per second bandwidth, it runs a 70B Q4 model at roughly 85 tokens per second, comparable to an RTX 5090 for that specific task while consuming a fraction of the power.</p><p>The architectural novelty is the per-core Neural Accelerator, which provides dedicated matrix-multiplication throughput that the previous generation lacked. Apple&#8217;s MLX framework is the software layer that extracts this performance, and it is genuinely impressive for a unified-memory platform. The catch remains what it has always been: <strong>MLX is Apple-only</strong>. Any code written for it does not transfer to CUDA or ROCm environments.</p><p>For researchers and developers invested in the Apple ecosystem, the M5 Max is the best portable AI machine Apple has shipped. For everyone else, the ecosystem lock-in is a legitimate concern that the hardware excellence cannot fully offset.</p><p>---</p><h2>The Ecosystem Divide</h2><p>One practical dimension that cuts across all three platforms is the software ecosystem.</p><p><em>NVIDIA&#8217;s CUDA</em> is the default for AI research. PyTorch, TensorFlow, JAX, and every major ML framework have CUDA as their first-class target. The DGX Spark runs the same containers as an H100 cluster. If your work involves anything beyond inference, CUDA compatibility is not optional, it is the foundation.</p><p><em>AMD&#8217;s ROCm</em> has closed much of the gap in 2025 and 2026. vLLM support, growing PyTorch compatibility, and llama.cpp validation on Strix Halo APUs have removed the worst pain points. ROCm still requires more care in setup than CUDA, and some libraries lag behind, but the trajectory is clear. AMD is serious about being a CUDA alternative, not just a cheaper CUDA alternative.</p><p><em>Apple&#8217;s MLX</em> is the fastest framework on M5 Max hardware but exclusively available on Apple Silicon. The lock-in is real. MLX models are not drop-in replacements for CUDA models, and the community support, while growing, does not approach the breadth of the NVIDIA ecosystem.</p><p>The practical implication is straightforward: if you are doing anything beyond inference, the DGX Spark&#8217;s software advantage is substantial. If you are purely running inference and cost matters, Strix Halo offers the best model-capacity-per-dollar by a significant margin. If you are already in the Apple ecosystem and want the fastest possible local inference, the M5 Max delivers.</p><p>---</p><h2>Summary and Closing Thoughts</h2><p>The following table summarizes our comparison.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{array}{|l|p{3.8cm}|p{3.8cm}|p{3.8cm}|p{3.8cm}|}\n\\hline\n\\textbf{Spec} &amp; \\textbf{AMD Strix Halo} &amp; \\textbf{NVIDIA DGX Spark} &amp; \\textbf{RTX 5090} &amp; \\textbf{Apple M5 Max} \\\\\n\\hline\n\\textbf{Price} &amp; \\text{\\$800--\\$2,500} &amp; \\text{\\$4,699} &amp; \\text{\\$1,999} &amp; \\text{\\$4,999} \\\\\n\\hline\n\\textbf{CPU} &amp; \\text{16-core Zen 5} &amp; \\text{20-core Arm (10x Cortex-X925)} &amp; \\text{N/A (requires host)} &amp; \\text{18-core (6P+12p)} \\\\\n\\hline\n\\textbf{GPU} &amp; \\text{40 RDNA 3.5 CUs} &amp; \\text{6,144 Blackwell CUDA cores} &amp; \\text{21,760 CUDA cores} &amp; \\text{40-core GPU+Neural Accl.} \\\\\n\\hline\n\\textbf{Memory} &amp; \\text{128G uni LPDDR5X} &amp; \\text{128G uni LPDDR5X} &amp; \\text{32G GDDR7} &amp; \\text{128G uni LPDDR5X} \\\\\n\\hline\n\\textbf{Bandwidth} &amp; \\text{~212 GB/s} &amp; \\text{273 GB/s} &amp; \\text{1,792 GB/s} &amp; \\text{614 GB/s} \\\\\n\\hline\n\\textbf{AI TOPS} &amp; \\text{126 TOPS (combined)} &amp; \\text{1,000 TOPS} &amp; \\text{3,352 TOPS} &amp; \\text{~800 TOPS (est.)} \\\\\n\\hline\n\\end{array}&quot;,&quot;id&quot;:&quot;LWHWAQLMXU&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{array}{|l|p{3.8cm}|p{3.8cm}|p{3.8cm}|p{3.8cm}|}\n\\hline\n\\textbf{Max VRAM} &amp; \\text{96 GB (VGM)} &amp; \\text{128 GB (unified)} &amp; \\text{32 GB} &amp; \\text{128 GB (unified)} \\\\\n\\hline\n\\textbf{Max dense Q4} &amp; \\text{~100B parameters} &amp; \\text{200B parameters (FP4)} &amp; \\text{70B parameters} &amp; \\text{70B parameters} \\\\\n\\hline\n\\textbf{Max MoE Q4} &amp; \\text{~180B total parameters        } &amp; \\text{~350B total parameters} &amp; \\text{~50--60B total parameters} &amp; \\text{~120B total parameters} \\\\\n\\hline\n\\textbf{LoRA fine-tune} &amp; \\text{Up to 13B} &amp; \\text{Up to 70B} &amp; \\text{Up to 13B} &amp; \\text{LoRA via MLX} \\\\\n\\hline\n\\textbf{Full fine-tune} &amp; \\text{Up to 12B FP16} &amp; \\text{Up to 8B 16K ctx} &amp; \\text{Up to 7B FP16} &amp; \\text{Not validated} \\\\\n\\hline\n\\textbf{Power draw} &amp; \\text{45--120W} &amp; \\text{~150W} &amp; \\text{450W+} &amp; \\text{~40W (entire chip)      } \\\\\n\\hline\n\\end{array}&quot;,&quot;id&quot;:&quot;LOZIIACBWP&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>And as a rough capability matrix we&#8217;d have :</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;  \\begin{array}{|l|c|c|c|c|}\n\\hline\n\\textbf{Task} &amp; \\textbf{Strix Halo} &amp; \\textbf{DGX Spark} &amp; \\textbf{RTX 5090} &amp; \\textbf{M5 Max} \\\\\n\\hline\n\\text{Max model per dollar} &amp; &#9989; &amp; &#10060; &amp; &#10060; &amp; &#10060; \\\\\n\\hline\n\\text{CUDA dev / fine-tuning} &amp; &#10060; &amp; &#9989; &amp; &#10060; &amp; &#10060; \\\\\n\\hline\n\\text{Speed on smaller models} &amp; &#10060; &amp; &#10060; &amp; &#9989; &amp; &#10060; \\\\\n\\hline\n\\text{Apple ecosystem users} &amp; &#10060; &amp; &#10060; &amp; &#10060; &amp; &#9989; \\\\\n\\hline\n\\end{array}\n&quot;,&quot;id&quot;:&quot;LURMVHPFEL&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>2026 is a peculiar year to be buying AI hardware. The pace of improvement means that any machine purchased today will look modest within two years. But the floor has risen dramatically. A $2,000 Strix Halo box can run models that required a $50,000 server rack two years ago. A $4,700 DGX Spark gives you datacenter-class software in a desktop form factor. Even a $550 RTX 5070 handles inference workloads that demanded a dual-GPU workstation not long ago.</p><p>The question is no longer whether local AI is viable. It is which platform matches your workflow, your ecosystem preferences, and your budget. The good news is that all three options covered here are legitimate, production-capable machines, not science projects. Pick the one that fits how you actually work</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Aleph Zero! Subscribe for free to receive new posts!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Local Models - A Guide to Running LLMs on Your Hardware]]></title><description><![CDATA[A landscape review of the best local language models in 2026 - from beefy 70B giants to pocket-sized 0.8B sidekicks]]></description><link>https://blog.aleph-tech.com/p/local-models-a-guide-to-running-llms</link><guid isPermaLink="false">https://blog.aleph-tech.com/p/local-models-a-guide-to-running-llms</guid><dc:creator><![CDATA[Alexis Gil Gonzales]]></dc:creator><pubDate>Fri, 01 May 2026 17:45:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M6ax!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M6ax!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M6ax!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!M6ax!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!M6ax!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!M6ax!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M6ax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:9033012,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alexisgilg.substack.com/i/196103429?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M6ax!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!M6ax!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!M6ax!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!M6ax!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83affff9-6321-4a68-8833-ee626fc2bed3_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Not so long ago, running a halfway-decent language model locally meant praying to the GPU gods and sacrificing your afternoon to a loading spinner. Those days are gone. The local LLM ecosystem has evolved at a pace that would make even the most jaded tech optimist raise an eyebrow.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Today&#8217;s landscape offers genuine, production-ready models for nearly every machine configuration : from beefy workstations with 64GB+ of RAM down to laptops that wouldn&#8217;t dream of touching a gaming GPU. The question isn&#8217;t whether local AI is viable anymore. It&#8217;s which model actually belongs on <em>your</em> system.</p><p>This guide cuts through the noise. I&#8217;ve attempted to draw a clear picture of where each model shines, where it wobbles, and  (most importantly) which task it was born to handle.</p><p>Let&#8217;s dive in.</p><div><hr></div><p></p><h2>Understanding the Stack - Quantization, VRAM, and Why It Matters</h2><p>Before getting to the good stuff, a quick explainer for the uninitiated.</p><p><strong>VRAM (Video RAM)</strong> is the memory on your graphics card. When running AI models locally, all the model weights and computations happen on the GPU, so you need enough VRAM to hold the entire model; think of it as RAM dedicated to your graphics card. More VRAM means you can run larger models or higher precision quantization. </p><p>When we talk about models in this guide, we&#8217;re almost exclusively discussing <strong>GGUF-quantized models</strong> : a format that squeezes large models into smaller file sizes with minimal quality loss. The quantization level (Q8_0, Q6_K, Q4_K_M, etc.) represents the precision at which the model&#8217;s weights are stored. Lower numbers mean smaller files and less VRAM required, but also some quality degradation.</p><p>For context: a Q8_0 model retains ~99% of quality but needs significantly more VRAM than a Q4_K_M model at ~95% quality. For most users, Q6_K or Q4_K_M hits the sweet spot between size and performance.</p><p>Also worth noting: <strong>context length</strong> (e.g. how many tokens the model can &#8220;see&#8221; at once) varies dramatically. Some models offer 128K tokens (enough for a short novel), while others push to 1M tokens (enough to ingest your entire code repository).</p><p>Let&#8217;s meet the players.</p><div><hr></div><p></p><h2>The 64GB Tier - Where Power Meets Patience</h2><p>These are the models that make you reconsider whether you really needed that monitor upgrade. They demand serious hardware but deliver real flagship-level performance.</p><h4>Qwen3.6-27B - The General Purpose Titan</h4><p>If you want one model that does nearly everything at a high level (coding, reasoning, agent workflows, general chat) the Qwen3.6-27B is your workhorse. This dense model punches well above its weight class, achieving 77.2% on SWE-bench Verified (a coding benchmark) and 94.1% on AIME26 (math olympiad problems). For reference, that&#8217;s better than some models twice its size.</p><p>The secret sauce? A hybrid architecture combining Gated DeltaNet with Gated Attention, plus multimodality that handles images and video out of the box. The &#8220;Thinking Preservation&#8221; mechanism keeps multi-step reasoning coherent across iterations : a godsend for complex agent tasks.</p><p>At ~28.6GB in Q8_0, you&#8217;ll need a serious GPU. This isn&#8217;t a model for the faint-hearted or the RTX 3060 crowd.</p><h4>Qwen3.6-35B-A3B - The Efficient Performer</h4><p>Meet the MoE (Mixture of Experts) counterpart to the 27B. With only 3 billion parameters active per token out of 35B total, this model flies : 3-5&#215; faster inference than the dense 27B while maintaining comparable quality.</p><p>The benchmark picture tells the story : 73.4% on SWE-bench Verified, 92.7% on AIME26, and 85.2% on MMLU-Pro. If you&#8217;re building coding agents or need fast iteration on complex tasks, this is the MoE you want.</p><p>The downslde is that Q6_K still needs ~25.6GB, so plan accordingly.</p><h4>Llama 3.3 70B - The Safe Big-Model Choice</h4><p>Meta&#8217;s 70B remains the reliable choice for workloads that need breadth. With 128K context and excellent multilingual support, it&#8217;s the workhorse for long-form writing, broad world knowledge, and situations where model reliability trumps raw benchmark chasing.</p><p>On IFEval (instruction following), Llama 3.3 70B actually outperforms models twice its size at 92.1%. It&#8217;s the model you reach for when you need to trust that your instructions will be followed without drama.</p><p>It won&#8217;t win any benchmark beauty pageants in 2026. But it will consistently deliver solid outputs, and sometimes that&#8217;s worth more than a flashy leaderboard position.</p><h4>Gemma 4 31B - The Reasoning Champion</h4><p>Google DeepMind&#8217;s Gemma 4 31B is the math and coding specialist that makes other models nervous. With 89.2% on AIME 2026 (math) and a Codeforces ELO of 2150, it&#8217;s the choice when your work involves analytical challenges.</p><p>The multimodal support is excellent (text, image, audio, video) and the native thinking mode lets you watch the model reason through problems step by step. For anyone doing complex technical work, this model&#8217;s thinking process is almost as valuable as its outputs.</p><p>At ~32.6GB in Q8_0, it&#8217;s a workstation model. But if your work involves heavy reasoning and coding, the investment pays off.</p><h4>Kimi-Linear-48B-A3B - The Context King</h4><p>When 1M token context lengths were a novelty, the Kimi-Linear-48B made them practical. Its hybrid linear attention architecture delivers 6&#215; faster decoding at 1M tokens compared to traditional attention, with a 75% reduction in KV cache memory usage.</p><p>For research, massive document analysis, or whole-codebase Q&amp;A, this is the model that makes &#8220;ingest everything&#8221; actually feasible on local hardware. Just don&#8217;t expect the highest raw benchmark scores : the architecture trades some MMLU-Pro performance for that incredible context efficiency.</p><h4>The Specialists</h4><p>Three more models deserve quick recognition:</p><p>- <strong>Nemotron Super 49B v1.5</strong> : NVIDIA&#8217;s reasoning specialist, hybrid Mamba-Transformer architecture, optimized for agentic tasks. The 1M token context is real, not marketing.</p><p>- <strong>Qwen3-30B-A3B-Thinking-2507</strong> : The thinking model for when you need visible, step-by-step reasoning on math and logic problems. 85% on AIME25, with a thinking mode that actually works.</p><p>- <strong>Qwen3-VL-32B</strong> : Vision-language specialist for OCR, document parsing, chart analysis, and multimodal agent workflows. If you need to understand images deeply, this is your model.</p><div><hr></div><p></p><h2>The 32GB Sweet Spot - High Performance, Realistic Hardware</h2><p>This tier represents the realistic sweet spot for most power users : machines with 32GB VRAM that still want decent performance without the workstation upgrade.</p><h4>Qwen3.5 27B &#8212; The People&#8217;s Champion</h4><p>The Qwen3.5 27B is what happens when quantization maturity meets excellent architecture. At Q6_K_M (~25GB), it delivers 86.1% on MMLU-Pro and 95% on IFEval; numbers that would have turned heads two years ago at twice the size.</p><p>The 262K native context means you can actually use that context without jumping through YaRN hoops. Multimodal support handles images. Tool calling works reliably. For general-purpose work (writing, research, coding, agents) this model delivers great quality at a realistic footprint.</p><h4>Gemma 4 31B - Premium Quality, Premium Price</h4><p>The 31B dense model for when quality is non-negotiable and speed is nice-but-not-essential. At 89.2% AIME and 2150 Codeforces ELO, the benchmark case writes itself. The 256K context and multimodal support are just icing.</p><p>On the minus side, Q6_K needs ~25GB, and Q4 still wants 18GB. Plan your VRAM accordingly.</p><h4>Qwen3.6-35B-A3B (UD-Q4_K_M) - The Efficient Performer</h4><p>Remember this model from the 64GB tier? At Q4_K_M quantization (~20-21GB), it becomes a real 32GB option without meaningful quality loss. The MoE architecture means you get 35B parameters worth of quality at 3B active parameter speed.</p><p>For coding agents, tool use, and fast iteration; this is the model that lets your 32GB machine punch above its weight class.</p><h4>DeepSeek-R1 Distill Qwen 32B - The Math Specialist</h4><p>When your work is math-heavy, this distilled DeepSeek R1 model delivers exceptional results. 94.3% on MATH-500 and 72.6% on AIME 2024. Those numbers belong to models twice its size.</p><p>The tradeoff : code performance (57.2% on LiveCodeBench) lags behind the generalists, and the 128K context is notably shorter than competitors. But if you&#8217;re building a math-focused application, the R1 distillation is remarkably cost-effective.</p><h4>Mistral Small 24B - The Agentic All-Rounder</h4><p>Mistral&#8217;s 24B hits a different niche : tool-calling and agent workflows. With 84.8% on HumanEval and strong instruction following, it&#8217;s the model for building assistants that need to reliably call functions and execute multi-step workflows.</p><p>The 32K context is the main limitation. But for local business automation, chat interfaces, and function-calling heavy applications, Mistral Small delivers at a reasonable footprint (~19GB Q6_K).</p><h4>The Supporting Cast</h4><p>- <strong>Gemma 4 26B A4B </strong>: MoE efficiency with 4B active params out of 26B total. Lower absolute performance than the 31B dense, but excellent for its VRAM efficiency.</p><p>- <strong>Qwen3.5 9B</strong> : A remarkable compact performer at ~13GB. 82.5% MMLU-Pro makes it a legitimate daily driver for users who don&#8217;t need maximum quality.</p><p>- <strong>Llama 3.1 8B</strong> : The stable, mature option for users who value ecosystem over benchmarks. Still useful for RAG, document ingestion, and long prompts.</p><div><hr></div><p></p><h2>The 16GB Realism Tier - Doing More With Less</h2><p>This is where local AI gets <em>really</em> democratized. 16GB VRAM (the domain of gaming GPUs and mobile workstations) can now run models that would have been science fiction a few years ago.</p><h4>Qwen3.5 9B - The Daily Driver</h4><p>At ~9GB in Q4_K_M, this is the model that fits on an RTX 3060 while delivering 82.5% MMLU-Pro and 91.5% IFEval. That&#8217;s not &#8220;good for a 9B model&#8221; : that&#8217;s just good, period.</p><p>For general chat, drafting, research, and daily tasks, the Qwen3.5 9B is the choice that makes local AI practical for anyone with consumer hardware.</p><h4>DeepSeek-R1 Distill Qwen 7B - The Math Genius</h4><p>The distilled R1 reasoning capabilities in a 4-5GB package.92.8% on MATH-500 and 55.5% on AIME. For math-focused applications, this is the budget choice that doesn&#8217;t compromise.</p><p>Just don&#8217;t ask it to code. At 37.6% on LiveCodeBench, the R1 distillation is a specialist, not a generalist.</p><h4>Qwen2.5 Coder 7B - The Code Specialist</h4><p>Speaking of specialists: if coding is the task, the Qwen2.5 Coder 7B delivers ~85% on HumanEval at ~4.7GB. For completions, refactors, debugging, and repo Q&amp;A, this is the dedicated code model that beats generalists on code tasks.</p><p>On the minus side, general knowledge (40.1% MMLU-Pro) is not this model&#8217;s strength.</p><h4>Phi-4 Mini Reasoning - The Compact Thinker</h4><p>At 3.8B parameters and ~2.5GB, the Phi-4 Mini Reasoning punches far above its weight class on math. 94.6% on MATH-500 and 57.5% on AIME; those numbers are remarkable for a sub-4GB model.</p><p>English-only is the core limitation. But for math-heavy applications where you need reasoning in a tiny package, Phi-4 Mini Reasoning is a revelation.</p><h4>Gemma 4 E4B - The Multimodal Lightweight</h4><p>For tasks that need vision without the VRAM cost, Gemma 4 E4B delivers text + image + audio understanding at ~5-6GB. It&#8217;s the model for edge deployment, laptops without dedicated GPUs, and applications that need multimodal support without the flagship footprint.</p><p>Benchmarks are modest, but the capability-to-footprint ratio is exceptional.</p><h4>The Micro Models</h4><p>The bottom of the stack still has great utility:</p><p>- <strong>Phi-3.5 Mini</strong> : Strong code (86% Python) and 128K context in ~2.8GB. Older model, but still useful.</p><p>- <strong>Qwen3.5 2B</strong> : 262K context in 1.3GB. The tiny giant for long-context retrieval tasks.</p><p>- <strong>Qwen3.5 0.8B</strong> : 262K context in under 1GB. Classification, routing, triage; tasks that don&#8217;t need reasoning.</p><p>- <strong>Gemma 4 E2B-it</strong>: Multimodal in 4GB. Runs on smartphones. The edge AI frontier.</p><div><hr></div><p></p><h2>Use Case Recommendations</h2><p>After reviewing the full landscape, here&#8217;s a practical guidance.</p><p><strong>For 64GB+ Workstations :</strong></p><p>- <strong>General purpose</strong> : Qwen3.6-27B, the do-anything flagship</p><p>- <strong>Speed + quality</strong> : Qwen3.6-35B-A3B, MoE efficiency, superior quality</p><p>- <strong>Math &amp; reasoning</strong> : Gemma 4 31B, 89.2% AIME speaks for itself</p><p>- <strong>Longest context</strong> : Kimi-Linear-48B-A3B, 1M tokens, 6&#215; faster</p><p>- <strong>Coding agents</strong> : Qwen3-Coder 30B-A3B, specialized for code work</p><p><strong>For 32GB Machines :</strong></p><p>- <strong>Best overall</strong> : Qwen3.5 27B, benchmark leader, excellent quality</p><p>- <strong>Value pick</strong> : Qwen3.6-35B-A3B Q4_K_M, MoE efficiency at realistic VRAM footprint</p><p>- <strong>Premium quality</strong> : Gemma 4 31B Q6_K, when quality trumps everything else</p><p>- <strong>Math focus</strong> : DeepSeek-R1 32B, the math specialist</p><p>- <strong>Tool calling</strong> : Mistral Small 24B, agentic workflows done right</p><p><strong>For 16GB Machines :</strong></p><p>- <strong>Best benchmarks</strong> : Qwen3.5 9B Q4, leaderboard-level scores at consumer GPU price</p><p>- <strong>Math + budget</strong> : DeepSeek-R1 7B, exceptional math, tiny footprint</p><p>- <strong>Coding specialist</strong> : Qwen2.5 Coder 7B, dedicated code model</p><p>- <strong>Compact reasoning</strong> : Phi-4 Mini Reasoning, 2.5GB of math magic</p><p>- <strong>Edge/mobile</strong> : Gemma 4 E2B, truly portable AI</p><div><hr></div><p></p><h2>Liquid Foundation Models - The Architecture That Thinks Differently</h2><p>When every other model in this guide relies on Transformer derivatives (attention mechanisms, feed-forward layers, the usual suspects) Liquid Foundation Models take a fundamentally different approach. Built on Liquid Neural Networks (LNNs), these models are rooted in dynamical systems and signal processing rather than the attention-is-all-you-need paradigm. The result is a family of models that prioritizes real-time adaptation, millisecond latency, and genuine on-device deployment.</p><p>Liquid AI&#8217;s model lineup spans two main series: <strong>LFM2</strong> and the newer <strong>LFM2.5</strong>, with variants ranging from 350M to 24B parameters. The philosophy is consistent across sizes: build models that run efficiently anywhere, from cloud servers to smartwatches, without sacrificing reliability.</p><h3>The LFM2 Series - Production-Ready Foundations</h3><p>The LFM2 series represents Liquid AI&#8217;s current production offering, designed for developers who need deploy-anywhere flexibility.</p><p><strong>Text Models</strong></p><p>- <strong>LFM2-350M</strong> : The lightest option in the family. CPU, NPU, and GPU execution make it genuinely device-agnostic : this model can run on hardware that wouldn&#8217;t dream of running a Llama variant. Benchmarks are modest (43.43% MMLU, 65.12% IFEval), but for simple classification, extraction, and routing tasks, it&#8217;s remarkably capable.</p><p>- <strong>LFM2-700M</strong> : The efficiency midpoint. Multilingual support is a standout feature &#8212; if you&#8217;re building applications that need to handle non-English text without cloud dependency, this model&#8217;s language handling is a genuine asset. 49.9% MMLU and 72.23% IFEval place it ahead of Qwen3.5 2B on most metrics while maintaining a similar footprint.</p><p>- <strong>LFM2-8B-A1B</strong> : The 8-billion parameter MoE variant with only 1B active parameters per token. This is Liquid AI&#8217;s answer to the Qwen3.5 9B question : comparable quality to dense 8B models at a fraction of the active compute. For on-device AI assistants, local chat, and privacy-sensitive applications, this model makes strong sense.</p><p>- <strong>LFM2-24B-A2B</strong> : The flagship text model in the LFM2 series. With 24B total parameters and only 2B active per token, it achieves tool-calling and agentic capabilities on consumer hardware without cloud dependency. This is the Liquid model for serious local agents; the one that may replaces your cloud API calls for all but the most demanding tasks.</p><p><strong>Vision-Language Models</strong></p><p>- <strong>LFM2-VL-450M</strong> : Compact multimodal processing at under 500M parameters. Text and image understanding in a package that can run on edge devices. For mobile applications, IoT dashboards, and vision tasks where latency matters, this model delivers.</p><p>- <strong>LFM2-VL-3B</strong> : The larger vision specialist at 3 billion parameters. Edge-optimized but capable of meaningful image understanding, document parsing, and multimodal agent workflows. This is the vision model for applications that need real image comprehension but can&#8217;t afford cloud round-trips.</p><h4>The LFM2.5 Series - Scaled Intelligence</h4><p>The LFM2.5 series marks Liquid AI&#8217;s next evolution, with models pretrained on 28T tokens using a scaled reinforcement learning pipeline. The quality jump is noticeable across the board.</p><p><strong>Text Models</strong></p><p>- <strong>LFM2.5-1.2B-Base</strong> : The base model for the 2.5 series. 28T tokens of pretraining gives this 1.2B model a quality floor that punches well above its weight class. For developers who need a reliable base to fine-tune, this is a strong starting point.</p><p>- <strong>LFM2.5-1.2B-Instruct</strong> : The instruction-tuned variant, optimized for agentic tasks and reliable instruction following. If you&#8217;re building local assistants, this model delivers the follow-instructions behavior you&#8217;d normally need a 7B+ model for.</p><p>- <strong>LFM2.5-1.2B-Thinking</strong> : The reasoning variant enables on-device reasoning under 1GB. Yes, a thinking/reasoning model that fits in less than 1GB of memory. For math-heavy applications where you want visible step-by-step reasoning on embedded hardware, this is a fine achievement.</p><p>- <strong>LFM2.5-350M</strong> : The smallest LFM2.5 model. Liquid AI&#8217;s &#8220;no size left behind&#8221; philosophy means even the smallest model gets the full treatment. This isn&#8217;t a neglected also-ran, it&#8217;s a first-class citizen in the family.</p><p><strong>Vision-Language Models</strong></p><p>- <strong>LFM2.5-VL-1.6B</strong> : Production-ready multimodal agents on any device. 1.6B parameters handling text and images together, built for the kind of edge deployment that other vision models can&#8217;t achieve.</p><p>- <strong>LFM2.5-VL-450M</strong> : The compact vision option for the 2.5 series. Structured visual intelligence at the edge, with the same architectural benefits as the rest of the Liquid lineup.</p><p><strong>Audio Model</strong></p><p>- <strong>LFM2.5-Audio-1.5B</strong> : End-to-end speech and text generation. 1.5B parameters for low-latency, high-quality conversations. If you&#8217;re building voice interfaces that need to work locally, Liquid&#8217;s audio model is worth serious attention.</p><h4>Task-Specific Nano Models</h4><p>Liquid AI also offers a collection of specialized nano models optimized for specific workloads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{array}{|l|p{10cm}|}\n\\hline\n\\textbf{Model} &amp; \\textbf{Purpose} \\\\\n\\hline\n\\text{LFM2-350M-Extract} &amp; \\text{Data extraction - pull structured info from unstructured text} \\\\[6pt]\n\\hline\n\\text{LFM2-1.2B-Extract} &amp; \\text{Enhanced extraction for more complex data pulling tasks} \\\\[6pt]\n\\hline\n\\text{LFM2-350M-Math} &amp; \\text{Mathematical reasoning at the smallest possible footprint} \\\\[6pt]\n\\hline\n\\text{LFM2-1.2B-RAG} &amp; \\text{Retrieval-augmented generation - purpose-built for RAG workloads} \\\\[6pt]\n\\hline\n\\text{LFM2-ColBERT-350M} &amp; \\text{Unified embedding model  - ``one model to embed them all''} \\\\[6pt]\n\\hline\n\\text{LFM2-1.2B-Tool} &amp; \\text{Tool-calling capabilities in a compact package} \\\\[6pt]\n\\hline\n\\text{LFM2-2.6B-Transcript} &amp; \\text{Transcription tasks} \\\\[6pt]\n\\hline\n\\text{LFM2-350M-ENJP-MT} &amp; \\text{English-Japanese translation} \\\\[6pt]\n\\hline\n\\text{LFM2-350M-PII-Extract-JP} &amp; \\text{Japanese PII extraction} \\\\\n\\hline\n\\end{array}\n&quot;,&quot;id&quot;:&quot;OYXDWZXNKL&quot;}" data-component-name="LatexBlockToDOM"></div><p>The ColBERT embedding model deserves special mention : Liquid AI&#8217;s approach to unified embeddings means you might be able to replace three or four separate embedding models with one. For production systems where embedding quality matters, this is worth benchmarking against separate embedding + retrieval pipelines.</p><h4>How Liquid Compares</h4><p>A meaningful comparison against the other models in this guide requires context. Liquid&#8217;s architecture is fundamentally different from the Transformer-based models that dominate this article. This isn&#8217;t a direct competitor to Qwen3.6-27B or Gemma 4 31B in raw benchmark terms. Instead, Liquid models compete on:</p><p><strong>Against Qwen3.5 9B / Llama 3.1 8B</strong> : The LFM2-8B-A1B offers comparable quality with MoE efficiency. 1B active params versus 8B dense. For on-device deployment where active parameter count directly maps to latency, Liquid&#8217;s architecture advantage is real.</p><p><strong>Against Phi-4 Mini Reasoning</strong> : The LFM2.5-1.2B-Thinking at under 1GB is the direct competitor to Phi-4 Mini Reasoning for on-device math reasoning. Liquid&#8217;s dynamical systems approach may offer advantages in multi-step reasoning coherence.</p><p><strong>Against Gemma 4 E4B / E2B</strong> : LFM2-VL-450M and LFM2-VL-3B offer comparable vision capabilities with Liquid&#8217;s architectural benefits: millisecond latency, true on-device execution, NPU optimization.</p><p><strong>The architectural differentiation</strong> : Where Transformer models scale poorly beyond their training context length, Liquid Neural Networks handle continuous inputs more naturally. For real-time applications, robotics, time-series analysis, or any task where inputs evolve over time, Liquid&#8217;s architecture offers fundamental advantages that benchmark comparisons don&#8217;t capture.</p><p><strong>The enterprise perspective</strong> : Liquid AI&#8217;s LEAP platform enables customization and fine-tuning within enterprise firewalls. For organizations that need proprietary models but lack the infrastructure to train from scratch, this is an interesting differentiator.</p><p>The benchmark table tells part of the story (LFM2-350M at 43.43% MMLU, LFM2-1.2B at 55.23% MMLU), but Liquid&#8217;s real value proposition is architectural : models that adapt in real-time, deploy anywhere, and prioritize latency in ways that Transformer-based models fundamentally cannot. If your use case fits that profile, Liquid Foundation Models are worth serious evaluation.</p><h4>Where Liquid Falls Short</h4><p>The marketing around Liquid Foundation Models is compelling, but the full picture includes real limitations that matter depending on your use case:</p><p><strong>Benchmark gaps on core tasks</strong> : Liquid AI&#8217;s own documentation concedes that LFMs currently struggle with zero-shot code tasks, precise numerical calculations, and tasks that require counting (famously, counting the letter &#8216;r&#8217; in &#8220;strawberry&#8221;). For coding agents, math-heavy workloads, or anything requiring precise arithmetic, the Transformer-based models in this guide (Qwen, Gemma, DeepSeek-R1) will consistently outperform Liquid at comparable model sizes.</p><p><strong>Retrieval-intensive task limitations</strong> : The LFM2 technical report explicitly acknowledges that models with linear attention and state-space operators have &#8220;limitations in retrieval-intensive tasks.&#8221; Tasks like associative recall (looking up a value given a key from earlier in the context) are fundamental weaknesses of RNN-style architectures versus Transformers. If your application involves querying information across long contexts, Liquid&#8217;s architecture is at a structural disadvantage.</p><p><strong>Weaker instruction following than competitors</strong> : The benchmark numbers don&#8217;t lie &#8212; LFM2-1.2B scores 74.89% on IFEval (instruction following) while Qwen3.5 2B scores higher despite being a similar footprint. On agentic tasks that require reliable tool use, multi-step reasoning, and strict adherence to instructions, Liquid models trail the Transformer-based competition.</p><p><strong>Training and optimization complexity</strong> : Liquid Neural Networks introduce additional complexity that the broader ecosystem hasn&#8217;t fully caught up with. Training LNNs involves Backpropagation Through Time (BPTT), gradient stability concerns (vanishing/exploding gradients in continuous-time dynamics), and ODE solver overhead; especially for the original LTC formulations. While CfC (Closed-form Continuous-time) models address the speed bottleneck, the tooling and operational expertise required is still significantly higher than working with standard Transformer models.</p><p><strong>Ecosystem immaturity</strong> : The sheer breadth of tooling, quantized variants, fine-tuned derivatives, and community support that exists for Qwen, Llama, and Gemma doesn&#8217;t yet exist for Liquid. If you hit a problem, you&#8217;re more likely to be in uncharted territory. The &#8220;new programming paradigm around working with operators, blocks, and backbones&#8221; that Liquid requires is really different; there&#8217;s a learning curve that the established model families don&#8217;t impose.</p><p><strong>Scaling ceiling</strong> : While individual Liquid models are parameter-efficient, the architecture faces open questions about scaling to extremely large model sizes. Research notes that &#8220;scaling liquid neural networks to very large and high-dimensional state spaces remains open,&#8221; and the sequential nature of ODE solving limits parallelization in ways that Transformers don&#8217;t face.</p><p><strong>Less mature RLHF and preference optimization</strong> : Liquid AI notes that &#8220;human preference optimization techniques have not been applied extensively to our models yet.&#8221; The alignment techniques (RLHF, DPO, constitutional AI) that make Transformer models feel truly helpful and safe are less developed in the Liquid lineup. This shows up in instruction following and general helpfulness benchmarks.</p><p><strong>Noise resilience concerns</strong> : Standard LNNs may produce overly confident predictions in noisy environments due to a lack of inherent uncertainty mechanisms. Research into uncertainty-aware variants aims to fix this, but it&#8217;s a known gap in the current production models.</p><p>The bottom line: Liquid Foundation Models excel at edge deployment, latency-sensitive applications, and real-time adaptive tasks. But for general-purpose code generation, math reasoning, and retrieval-heavy workloads (the tasks that dominate most local AI use cases) the Transformer-based models in this article deliver better results out of the box.</p><div><hr></div><p></p><h2>The Road Ahead</h2><p>The local LLM landscape in 2026 is genuinely remarkable. What once required datacenter resources now fits in your workstation; and increasingly, in your laptop bag. The combination of MoE architectures, improved quantization techniques, and hybrid attention mechanisms means the gap between &#8220;local&#8221; and &#8220;cloud&#8221; performance is narrower than ever.</p><p>Whether you&#8217;re running a coding agent, doing research on massive document collections, building a local assistant, or just want AI that respects your privacy, there&#8217;s never been a better time to go local.</p><p>Your RAM has been waiting for this moment. Time to put it to work.</p><div><hr></div><p></p><p><em>Models discussed in this article : Qwen3.6-27B, Qwen3.6-35B-A3B, Llama 3.3 70B, Nemotron Super 49B, Gemma 4 31B, Kimi-Linear-48B-A3B, Qwen3-30B-A3B-Thinking-2507, Qwen3-Coder 30B-A3B, Qwen3-VL-32B, Qwen3.5 27B, Gemma 4 26B A4B, DeepSeek-R1 Distill 32B, Mistral Small 24B, Qwen3.5 9B, Llama 3.1 8B, Qwen3.5 9B (16GB), DeepSeek-R1 7B, Qwen2.5 Coder 7B, Phi-4 Mini Reasoning, Gemma 4 E4B, Phi-3.5 Mini, Qwen3.5 2B, Qwen3.5 0.8B, Gemma 4 E2B-it, LFM2-350M, LFM2-700M, LFM2-8B-A1B, LFM2-24B-A2B, LFM2-VL-450M, LFM2-VL-3B, LFM2.5-1.2B-Base, LFM2.5-1.2B-Instruct, LFM2.5-1.2B-Thinking, LFM2.5-350M, LFM2.5-VL-1.6B, LFM2.5-VL-450M, LFM2.5-Audio-1.5B, and Liquid nano models (Extract, Math, RAG, Tool, ColBERT, Transcript, MT).</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Like this ? Subscribe to receive new posts !</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Agent Systems Design Space: Architecture, Competition, and the Horizon Ahead]]></title><description><![CDATA[A summary and analysis of the Claude Code Architecture study (arXiv:2604.14228v1)]]></description><link>https://blog.aleph-tech.com/p/agent-systems-design-space-architecture</link><guid isPermaLink="false">https://blog.aleph-tech.com/p/agent-systems-design-space-architecture</guid><dc:creator><![CDATA[Alexis Gil Gonzales]]></dc:creator><pubDate>Mon, 27 Apr 2026 14:40:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8mEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><h2>Design Space Analysis - Five Values, Thirteen Principles</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8mEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8mEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!8mEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!8mEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!8mEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8mEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8611442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://alexisgilg.substack.com/i/195620353?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8mEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!8mEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!8mEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!8mEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300cd0fc-d357-4853-bcd3-0d6425ddd869_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The paper&#8217;s starting point is honest: architecture is values made concrete. Before describing a single component, the authors identify five human values that drive the design of an AI coding agent:</p><ol><li><p><strong>Human Decision Authority</strong> : the human remains the point of control, even when the agent could plausibly proceed autonomously.</p></li><li><p><strong>Safety, Security, and Privacy</strong> : system design treats these not as features to add later but as structural constraints.</p></li><li><p><strong>Reliable Execution</strong> : agents must produce consistent, repeatable outcomes rather than fluky brilliance followed by spectacular failures.</p></li><li><p><strong>Capability Amplification</strong> : the system exists to make the human more capable, not to demonstrate the model&#8217;s capabilities.</p></li><li><p><strong>Contextual Adaptability</strong> : the agent must gracefully handle radically different contexts, from a two-person startup to a regulated enterprise.</p></li></ol><p>These five values translate, somewhat heroically, into thirteen design principles. The most interesting is &#8220;deny-first permission evaluation&#8221; : the system assumes any action requires explicit permission until proven otherwise. This is architecturally unusual. Most agent frameworks adopt an open-by-default model where tools are available unless explicitly restricted. Claude Code inverts this, treating permission as a first-class architectural concern rather than a later addition.</p><p>The permission system itself is notably sophisticated: seven distinct modes (plan, default, auto, dontAsk, bypassPermissions, bubble for subagent escalation, and acceptEdits), backed by an ML-based classifier. The classifier evaluates each action against the permission context and decides whether to surface a prompt to the user. This is not a hard-coded rules engine; it learned from real usage patterns.</p><p>The five-layer context compaction pipeline is the other substantial architectural contribution. Rather than treating context as cheap and abundant, the design treats it as a scarce resource to be managed actively. The pipeline has five stages: budget reduction, snip, microcompact, context collapse, and auto-compact; each representing a progressively more aggressive form of context reduction. This acknowledges the uncomfortable reality that context windows are finite, expensive, and degrades gracefully as you approach their limits.</p><p>The architecture also implements four extensibility mechanisms : MCP servers, plugins, skills, and hooks; each with different context costs and capability profiles. This is a thoughtful recognition that users will inevitably want to extend the system, and the question is not whether to enable extension but how to make it survivable from a context standpoint.</p><p>A striking finding : only about 1.6% of the codebase handles actual AI decision-making. The remaining 98.4% is <em>operational infrastructure</em>. This should alert anyone who has spent the last two years arguing about which foundation model to use. The scaffold matters more than the model. More on this later.</p><div><hr></div><p></p><h2>Architecture Comparison with OpenClaw - Same Question, Different Answers</h2><p>The comparison between Claude Code and OpenClaw is the paper&#8217;s most intellectually satisfying section, precisely because it resists the temptation to declare a winner. Instead, it reveals how identical design questions produce contextually appropriate answers.</p><p>Both systems are AI coding agents. Both must handle the same fundamental challenges : how to evaluate permissions, how to manage context, how to expose extensibility. And yet there are a few notable differences :</p><p><strong>Permission evaluation</strong> :  Claude Code performs per-action safety evaluation within a CLI loop. OpenClaw employs perimeter-level access control within a gateway control plane. These are not merely different implementations; they reflect different deployment philosophies. Claude Code assumes a <em>single user</em> at a terminal, making fast per-action decisions tractable. OpenClaw assumes a <em>multi-user</em> gateway context where perimeter control is more efficient than per-action prompting. Neither is universally right.</p><p><strong>Context management</strong> : Claude Code&#8217;s five-layer compaction pipeline has a direct analogue in OpenClaw&#8217;s gateway-wide capability registration. The mechanisms differ but the underlying problem is identical: context is expensive, and you need a strategy for managing it before it runs out. The architectural difference reflects deployment context: CLI agents see context as a per-session problem; gateway agents see it as a system-wide resource to be allocated across many concurrent sessions.</p><p><strong>Extensibility</strong> : Claude Code&#8217;s four-tier extensibility model (MCP servers, plugins, skills, hooks) has a rough parallel in OpenClaw&#8217;s gateway extension mechanisms. The OpenClaw architecture appears to lean more heavily on gateway-level registration, whereas Claude Code distributes extensibility across different cost profiles.</p><p>The deeper point is one the paper makes quietly but clearly : the design space for agent systems is context-dependent in ways that make direct comparison philosophically suspect. Claude Code is designed for a developer sitting at a terminal who wants a capable, safe, fast coding partner. OpenClaw is designed for organizations that need to deploy agents at scale behind a gateway with consistent governance. These are genuinely different problems, and the architectures reflect those differences. A framework that declares one superior to the other without specifying the deployment context is not making a scientific claim : it is making noise.</p><p>This is a welcome corrective to the benchmark-driven discourse that dominates agent system evaluation. SWE-bench scores, HumanEval results, and similar metrics are useful signals but they are not architecture evaluations. The paper&#8217;s implicit argument (that architecture choices are too important to leave to leaderboard position comparisons) is well taken.</p><div><hr></div><p></p><h2>Open Directions - Six Bets on the Future of Agent Systems</h2><p>The paper identifies six open directions, each representing a real gap between current capability and what the field needs. What follows is an analysis of each direction, informed by current research, with feasibility assessments and cost-benefit evaluations</p><h4>Direction 1: Bridging the Observability-Evaluation Gap</h4><p><strong>Feasibility: 7/10 | Timeline: 18&#8211;36 months | Impact: High</strong></p><p>The field has a fractured relationship with understanding what agents actually do. &#8220;Observability&#8221; and &#8220;evaluation&#8221; are treated as a single problem but they are distinct : <em>observability</em> is about understanding what happened (trace, log, record), while <em>evaluation</em> is about determining whether what happened was correct (judge, score, assess). Production agent systems need both simultaneously, but most tooling solves one or the other.</p><p>Current state: fragmented. <strong>AgentTrace</strong> offers structured logging taxonomies. <strong>HAL</strong> (the agent harness analysis project) has produced the uncomfortable finding that scaffold design explains more variance in agent performance than model choice; yet all major leaderboards compare models. SWE-EVO has further destabilized the field by demonstrating benchmark instability : even frontier models solve 19&#8211;21% of problems that a simplified benchmark assigns 65% to, depending on evaluation conditions.</p><p>The business model problem is the real blocker. Observability and evaluation infrastructure is expensive to build and maintain, and no company has yet found a compelling revenue model for selling it as a standalone product. It tends to get built as part of a broader platform (Databricks, Azure, AWS all have nascent offerings) but the depth of evaluation tooling available for, say, database query optimization does not yet exist for agent systems.</p><p><strong>Cost-benefit</strong> : High. Organizations running agents in production without observability and evaluation infrastructure are flying blind. The cost of building this infrastructure is real but the cost of operating without it (failed agents, undetected failures, expensive hallucination cycles) is rapidly becoming the larger line item.</p><div><hr></div><p></p><h4>Direction 2: Cross-Session Persistence</h4><p><strong>Feasibility: 8/10 | Timeline: 12&#8211;24 months | Impact: Medium-High</strong></p><p>The ambition here is modest but real : agents should remember what happened in previous sessions so that subsequent sessions are not forced to start from zero. The research has converged on a three-tier taxonomy : episodic memory (what happened in this session), semantic memory (what did I learn from this), and procedural memory (how do I do this type of task).</p><p>Key implementations are already in the field. <strong>MemGPT</strong> pioneered the distinction between core memory and archival memory, treating them as different storage tiers with different retrieval costs. <strong>Springdrift</strong> demonstrated continuous persistent agents as supervised processes : agents that run as long-lived processes rather than session-scoped invocations. <strong>MemMachine</strong> achieved 93% and 92% on multi-hop retrieval benchmarks using contextualized retrieval.</p><p>The core architecture question is largely solved. The remaining engineering problems (schema versioning across memory updates, efficient state restoration, privacy and selective forgetting) are tractable rather than fundamental. The bigger risk is that the problem becomes economically irrelevant : as context windows expand (1M token contexts are now standard, 10M is on the roadmap), the pressure to externalize memory weakens. External memory is valuable primarily because context windows are constrained. If constraints ease, the problem shrinks.</p><p><strong>Cost-benefit</strong> : Positive near-term. Cross-session persistence is achievable with current engineering and delivers meaningful user experience improvements. The risk is that it becomes a transitional technology rendered obsolete by context window expansion &#8212; but &#8220;transitional&#8221; should not be confused with &#8220;unworthwhile.&#8221;</p><div><hr></div><p></p><h4>Direction 3: Evolving Harness Boundaries</h4><p><strong>Feasibility: 7/10 | Timeline: 12&#8211;24 months | Impact: High</strong></p><p>This is where the paper&#8217;s earlier finding (98.4% of the codebase is scaffolding) becomes a research agenda. If scaffolds explain more variance than models, we need to understand scaffolds systematically rather than empirically.</p><p><strong>SWE-agent</strong> is the most compelling data point. The entire SWE-agent implementation is roughly 100 lines of python code. It achieves &gt;74% on SWE-bench, outperforming systems with vastly more complex scaffolding. <strong>Live-SWE-agent</strong> extends this with a self-evolving runtime : at 79.2% with Claude Opus 4.5, it is competitive with systems that consume an order of magnitude more infrastructure. The implication is uncomfortable for anyone who has invested heavily in complex harness design : <strong>simple harnesses can outperform complex ones, and we do not fully understand why</strong>.</p><p>HAL&#8217;s analysis confirms the broader pattern: scaffold choices dramatically impact both accuracy AND cost, yet comparisons across scaffolds are rare in the literature. The field is empirically driven in a domain where empirical results are notoriously fragile to benchmark-specific noise.</p><p><strong>Cost-benefit</strong> : Potentially very high. Understanding harness design systematically could unlock more performance improvement per dollar than switching foundation models; and it would be available to everyone regardless of which model&#8217;s API they use. The cost is primarily research time and the risk is that the field continues treating this as an engineering problem rather than a research one.</p><div><hr></div><p></p><h4>Direction 4: Scaling to Scientific Programs</h4><p><strong>Feasibility: 4/10 | Timeline: 5+ years | Impact: Very High</strong></p><p>This is the most ambitious and the most sobering direction in the set. The vision is agents that can conduct full scientific research programs : not just assist with literature review or draft papers, but formulate hypotheses, design experiments, implement them, iterate on failed implementations, and produce validated scientific findings.</p><p>Current state : not close. A systematic biorxiv study evaluated eight AI agent frameworks on autonomous scientific research tasks and found that none completed a full research cycle. All produced hallucinations. All failed at robust implementation. The problems are not incremental; they represent a fundamental gap between &#8220;useful coding assistant&#8221; and &#8220;autonomous scientist.&#8221;</p><p>The core issue is factual grounding in knowledge-intensive domains. A coding agent hallucinating a function call is annoying. A scientific agent hallucinating a molecular mechanism or a statistical relationship is dangerous. Scientific knowledge has a higher truth bar than code : the domain tolerates far less error, and the consequences of error are more severe.</p><p>Benchmarks do not help here. Current benchmarks measure isolated task performance on well-defined problems : precisely the conditions that do not hold in actual scientific research. Real science is open-ended, iterative, and requires judgment calls about which results to pursue and which to abandon. Benchmarking this requires benchmarks that do not yet exist.</p><p><strong>Cost-benefit</strong> : The potential payoff is transformative : autonomous scientific discovery at scale would be one of the most significant technological developments in human history. The cost is also transformative : this requires fundamental research advances, not engineering improvements. The risk-adjusted expected value is positive but the variance is enormous. This is a long-lasting bet appropriate for well-capitalized research organizations, not production engineering teams.</p><div><hr></div><p></p><h4>Direction 5: Governance at Scale</h4><p><strong>Feasibility: 6/10 | Timeline: 18&#8211;36 months | Impact: High</strong></p><p>When organizations deploy a single agent for a single task, governance is tractable. When they deploy hundreds or thousands of agents performing heterogeneous tasks across departments and jurisdictions, the governance problem becomes really hard. Who is accountable? What are the constraints? How do you enforce constraints when the agent&#8217;s action space is large and dynamic?</p><p>Current state : early but accelerating. <strong>AI Gateway</strong> (Databricks), <strong>Institutional AI </strong>(enforceable constraints via Oracle/Controller patterns), and <strong>MI9</strong> (six coordinated runtime mechanisms) represent different approaches to multi-agent governance. <strong>GaaS</strong> (Governance as a Service) explores black-box governance that operates without requiring model cooperation; an important distinction, since not all agents will be cooperative participants in governance frameworks.</p><p>The regulatory pressure is now real. The <strong>EU AI Act</strong> becomes enforceable in August 2026. The Colorado AI Act takes effect in July 2026. Organizations deploying agent systems at scale will need documented governance frameworks not as a best practice but as a legal obligation. This is no longer an academic concern.</p><p>The cost of governance infrastructure is non-trivial : estimates suggest it adds 20&#8211;50% to orchestration budgets for large enterprises, which translates to $8&#8211;15M annually for organizations at scale. This is a meaningful line item that will disturb the minds of CFOs and CTOs alike.</p><p><strong>Cost-benefit</strong> : Strong near-term. Regulatory pressure makes governance infrastructure non-optional for organizations operating in EU and US jurisdictions. The cost is high but the cost of non-compliance is higher. This direction is less about technological research and more about engineering implementation of known patterns.</p><div><hr></div><p></p><h4>Direction 6: Preserving Long-Term Human Capability Alongside Short-Term Amplification</h4><p><strong>Feasibility: 5/10 | Timeline: 3&#8211;5 years | Impact: Very High</strong></p><p>This direction is the most underappreciated in technical circles and the most strategically consequential in the long run. McKinsey estimates $2.9 trillion in annual US economic value from AI augmentation. The question is not whether AI can amplify human capability - it demonstrably can - but whether it can do so while preserving the long-term capability of the humans it augments.</p><p>The concern is not abstract. A physician who delegates all diagnostic reasoning to an AI system may become dramatically more productive in the short term and dramatically <em>less capable in the long term</em>. An engineer who uses AI to write all their code may ship more features in the short term and <em>lose the ability to reason about system architecture in the long term</em>. This is not science fiction : it is a well-documented pattern in tool use research.</p><p>The <strong>WORKBank</strong> study of 1,500 workers across 844 tasks found diverse Human Agency Scale profiles : different people respond differently to augmentation, and the factors that predict capability preservation versus capability atrophy are not yet well understood. Early signals suggest skills shift from information-focused to interpersonal as AI handles more information processing : which may be a positive adaptation or may be an erosion of certain cognitive muscles, depending on your perspective.</p><p>The most interesting data point may be Andrej Karpathy&#8217;s public statements about his own workflow: roughly 16 hours per day expressing intent to AI systems and delegating execution. This is a new mode of human-AI interaction that has no historical precedent, and its long-term effects on human capability are unknown.</p><p>The key insight from early research : augmentation (preserving institutional knowledge + eliminating routine work) may generate 2&#8211;4x more value than replacement in knowledge-intensive roles. Organizations that understand this will invest in capability-preserving augmentation architectures; organizations that optimize purely for short-term productivity will extract value until the humans they depend on can no longer provide it.</p><p><strong>Cost-benefit</strong> : Hard to quantify but potentially the most important direction in this list. Unlike the others, it is not primarily a technology problem : it is a human factors and organizational design problem. The technical work is relatively tractable ; the harder work is economic incentive alignment and measurement.</p><div><hr></div><p></p><h2>Where to Place Your Bets</h2><p>The six directions form something of a causal chain. <em>Observability</em> and <em>evaluation</em> are prerequisites for everything else : you cannot improve what you cannot measure (and here I recall an old Dutch sentence : <em>meten is weten</em>), and you cannot govern what you cannot observe. Cross-session persistence enables long-horizon tasks but raises the governance stakes. Scientific programs represent the extreme edge case where all of the above are simultaneously required at maximum intensity. Governance becomes non-optional at scale, and the question of human capability preservation is ultimately the question of whether the whole endeavor serves human flourishing or merely human productivity.</p><p><strong>Most actionable near-term</strong> : Directions 2 (cross-session persistence) and 3 (harness boundaries). These are primarily engineering problems with working implementations and clear user value. Direction 1 (observability) is also engineering but lacks a sustainable business model for standalone tooling, which slows adoption.</p><p><strong>Highest risk</strong> : Direction 4 (scientific programs). Current systems cannot complete full research cycles and hallucinate in ways that are dangerous in scientific contexts. This is not an engineering problem : it requires fundamental advances in factual grounding and reasoning under uncertainty.</p><p><strong>Most strategically undervalued</strong> : Direction 6 (human capability). Almost no technical research attention despite being the difference between AI amplifying human capability and making humans obsolete. Organizations that solve this first will have a durable advantage that cannot be replicated by better models alone.</p><p>The framing of &#8220;open directions&#8221; is appropriate : these are truly open, meaning both that the problems are unsolved and that the solutions, when found, will likely look different from what we currently imagine. The paper deserves credit for identifying real gaps rather than invented ones. The agent systems field has no shortage of impressive demos and insufficient shortage of honest accounting of what remains unsolved. This paper is a great contribution to the latter category.</p><p><strong>This analysis is based on arXiv:2604.14228v1 and current research as of April 2026</strong>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Did you like this ? Subscribe to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Autogenesis - When AI Agents Learn to Improve Themselves ]]></title><description><![CDATA[(Without Asking Permission)]]></description><link>https://blog.aleph-tech.com/p/autogenesis-when-ai-agents-learn</link><guid isPermaLink="false">https://blog.aleph-tech.com/p/autogenesis-when-ai-agents-learn</guid><dc:creator><![CDATA[Alexis Gil Gonzales]]></dc:creator><pubDate>Sun, 19 Apr 2026 12:31:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-n-6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is something remarkable about a research paper that proposes to solve a problem by letting AI agents evolve themselves (and manages to do so without once using the phrase &#8220;skynet becomes self-aware&#8221;). Wentao Zhang&#8217;s <strong>Autogenesis: A Self-Evolving Agent Protocol</strong> does exactly that, threading the needle between ambitious and pragmatic in a way that feels increasingly rare in a field notorious for both.</p><p>The timing is not coincidental. As large language model-based agent systems tackle increasingly complex, multi-step tasks  (writing and testing code, conducting research, coordinating with other systems), the infrastructure holding these systems together has begun to show cracks. We have built impressive robots, but given them clunky instruction manuals written in a language they were never quite designed to follow.</p><p>Autogenesis is Zhang&#8217;s attempt to give those robots not just better instructions, but the ability to rewrite their own.</p><h2>The Cracks in the Foundation</h2><p>The paper opens with an observation that anyone who has built production agent systems will find familiar: current protocols are... let&#8217;s say *aspirational*.</p><p>Frameworks like A2A (Agent-to-Agent) and MCP (Model Context Protocol) have become standard scaffolding for agentic AI systems. They define how agents communicate, how tools are invoked, how context is managed. What they conspicuously fail to define, Zhang argues, is how agents should <em>change</em> over time : how they should track versions of themselves, manage the lifecycle of resources (prompts, tools, memory), and safely update without breaking the entire system.</p><p>The result, he writes, is that agent compositions tend toward what engineers diplomatically call &#8220;monolithic compositions and brittle glue code.&#8221; Less diplomatically: the kind of spaghetti that makes future maintainers weep into their keyboards at 2am.</p><p>This is a real problem. As agent systems grow more sophisticated (coordinating across multiple entities, maintaining long-horizon context, invoking tools dynamically), the lack of explicit evolution interfaces becomes not just an inconvenience but a fundamental bottleneck. You can build impressive individual agents, but evolution-safe updating remains an afterthought at best.</p><p>Zhang&#8217;s diagnosis is crisp: <em>existing protocols under specify cross-entity lifecycle and context management, version tracking, and evolution-safe update interfaces.</em></p><p>He&#8217;s not wrong. The agent protocols we have are excellent at describing what agents should <em>do</em>, and mediocre at describing how they should <em>grow</em>.</p><h2>Autogenesis Protocol: Two Layers, One Elegant Idea</h2><p>The core contribution is the Autogenesis Protocol (AGP), which rests on a deceptively simple insight: separate the <em>what</em> of evolution from the <em>how</em>.</p><p><strong>What</strong> evolves? Everything. Prompts, agents, tools, environments, memory. Zhang models all of these as protocol-registered resources with explicit state, lifecycle, and versioned interfaces. Think of it as a kind of taxonomy for the components of an agent system, where each component knows not just what it does but where it came from and how it changes.</p><p><strong>How</strong> does evolution occur? Through a closed-loop operator interface : propose an improvement, assess whether it actually works, commit if it does, roll back if it doesn&#8217;t.</p><p>This is the Self Evolution Protocol Layer (SEPL), and it is arguably the more interesting contribution. SEPL specifies an auditable mechanism for agent self-improvement: agents can propose modifications to themselves or other agents, those proposals get evaluated against actual performance metrics, and only verified improvements get committed. Everything is tracked. Everything is revertable. No agent wakes up one morning  inexplicably improved itself with no record of how.</p><p>The Resource Substrate Protocol Layer (RSPL) handles the state and lifecycle of resources. Each resource ( a prompt, a tool, a memory store), gets a standardized interface that makes it queryable, versionable, and composable. The result is less like a hardcoded agent and more like a well-documented system where components can be swapped, updated, and audited without requiring a full architectural rethink.</p><p>This layered approach is... refreshing. In a field that often oscillates between &#8220;let&#8217;s add another abstraction layer&#8221; and &#8220;actually, we should remove all abstractions,&#8221; Zhang&#8217;s two-layer design feels considered. RSPL provides the bones; SEPL provides the nervous system for change. Here&#8217;s the illustration from the paper itself : </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-n-6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-n-6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 424w, https://substackcdn.com/image/fetch/$s_!-n-6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 848w, https://substackcdn.com/image/fetch/$s_!-n-6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 1272w, https://substackcdn.com/image/fetch/$s_!-n-6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-n-6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1077080,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://alexisgilg.substack.com/i/194644700?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-n-6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 424w, https://substackcdn.com/image/fetch/$s_!-n-6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 848w, https://substackcdn.com/image/fetch/$s_!-n-6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 1272w, https://substackcdn.com/image/fetch/$s_!-n-6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1f5868-8a60-41c4-9644-a6f97436f070_3272x1530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Autogenesis System: Self-Evolution in Practice</h2><p>Building on the protocol, Zhang presents the Autogenesis System (AGS) : a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution.</p><p>In less jargony terms: AGS is a proof of concept that these ideas actually work in practice. Multiple agents coordinate on long-horizon tasks that require planning and tool use across heterogeneous resources. The system can improve its own components mid-execution, with the SEPL layer ensuring that improvements are evaluated before being committed.</p><p>The benchmarks used are appropriately demanding: tasks requiring long-horizon planning, multi-step tool invocation, and coordination across heterogeneous resources. AGS shows consistent improvements over strong baselines.</p><p>Consistent is the operative word here. The paper is careful not to oversell the results : this is not &#8220;our system is 10x better,&#8221; but rather &#8220;our approach demonstrates measurable improvements on challenging benchmarks, supporting the effectiveness of agent resource management and closed-loop self-evolution.&#8221; In a field where every other paper promises transformative gains, the measured tone is almost endearing.</p><h2>Should I Care?</h2><p>Let&#8217;s be honest about what this paper does and doesn&#8217;t do.</p><p><em>What it does well:</em></p><p>First, it identifies a real pain point. Anyone who has built agent systems at scale has encountered the &#8220;brittle glue code&#8221; problem Zhang describes. The existing protocol landscape has been effective at solving communication but largely punted on the evolution problem. Autogenesis takes that problem seriously.</p><p>Second, the two-layer architecture is truly elegant. Decoupling what evolves from how evolution occurs sounds abstract, but it has practical implications for how you build and maintain agent systems. It also provides a natural place for auditing : you can inspect the SEPL layer to understand exactly what changed and why.</p><p>Third, the closed-loop evaluation mechanism is the right instinct. Self-evolution without evaluation is just self-modification, and self-modification without checks is how you get systems that optimize for the wrong metric in ways that are hard to detect. By requiring that improvements be assessed before being committed, SEPL provides a natural safety valve.</p><p><em>What it&#8217;s less clear about:</em></p><p>The benchmarks, while appropriate, are narrow. The paper demonstrates improvements on specific tasks, but the broader claim  (that self-evolving agents will consistently outperform static ones) would benefit from more diverse evaluation scenarios. It&#8217;s a start, not a conclusion.</p><p>The complexity overhead is real. Implementing RSPL and SEPL adds layers of abstraction that could be burdensome for simpler agent systems. Zhang seems aware of this, positioning Autogenesis as most valuable for complex, long-horizon tasks rather than simple one-shot interactions. But the trade-off between flexibility and simplicity is one that practitioners will have to judge carefully.</p><p>The governance question is left implicit. If agents can evolve themselves, who sets the evaluation criteria? Who determines what counts as an &#8220;improvement&#8221;? The protocol handles <strong>how</strong> evolution occurs, but the <strong>what</strong> and <strong>why</strong> (what the goals should be, who gets to define them) remains outside the scope. This is understandable for a technical paper, but it&#8217;s the question that will inevitably arise when this work meets production systems.</p><h2>Final thoughts</h2><p>Autogenesis is a genuinely thoughtful piece of work from someone who has clearly built agent systems and felt the pain he describes. It proposes a solution that is architecturally clean, practically motivated, and modestly presented. It does not promise the moon, it does not claim to have solved AGI, and it does not use the word &#8220;synergy&#8221; even once.</p><p>In a field where papers often oscillate between hype and despair, that restraint is itself a kind of achievement.</p><p>Whether Autogenesis becomes a foundational layer for the next generation of agent systems, or remains an elegant but underutilized idea, depends on factors the paper cannot control: adoption by framework developers, practical experience with the protocol in production, and the inevitable iteration that comes when rubber meets road.</p><p>But for now, it is worth reading :  not because it has all the answers, but because it is asking the right questions about a problem that will only become more pressing as agent systems grow more capable.</p><p>And in a world where AI systems are increasingly expected to do more, for longer, and with less supervision, figuring out how they should evolve  (and how we can make sure they evolve <em>well</em>) seems like a question worth spending time on.</p><p>Even if those systems occasionally still manage to be confidently wrong, just with better reasoning.</p><div><hr></div><p></p><h2>Implementations</h2><p>Several github projects have picked up the Autogenesis mantle; some faithfully reimplementing the protocol, others riffing on the same ideas independently.</p><h4>SkyworkAI/DeepResearchAgent</h4><p>The official implementation by Wentao Zhang himself. A hierarchical multi-agent research system built directly on the Autogenesis Protocol (RSPL + SEPL), with a top-level planning agent coordinating specialized sub-agents. Resources (prompts, tools, memory) are dynamically instantiated and refined during execution. Includes built-in optimizers (reflection, GRPO, Reinforce++)  and benchmark evaluation code for GPQA, AIME, GAIA, and LeetCode. This is the most complete, faithful expression of the paper&#8217;s architecture.</p><h4>EvoAgentX/Awesome-Self-Evolving-Agents</h4><p>A comprehensive survey repository cataloguing 200+ papers and open-source frameworks in the self-evolving agents space. Not a direct implementation, but an invaluable map of the broader landscape; including frameworks that predate Autogenesis but share its philosophical DNA. Good starting point if you want to understand where Autogenesis sits in relation to the rest of the field.</p><h4>EvoMap/evolver </h4><p>A protocol-constrained self-evolution engine built around the Genome Evolution Protocol (GEP). Where Autogenesis separates what evolves from how, GEP evolution into reusable assets called genes and capsules. Similar goals, different protocol design. Includes audit trails, human-in-the-loop review mode, and a structured asset system for governance-conscious evolution.</p><h4>CharlesQ9/Self-Evolving-Agents </h4><p>A survey paper and associated repository covering the path to artificial super intelligence through self-evolving agents. References Autogenesis alongside other landmark frameworks (Voyager, G&#246;del Machine, AlphaEvolve). More of a research map than an implementation, but useful for understanding the broader trajectory the field is moving along.</p><div><hr></div><p></p><p><a href="https://arxiv.org/abs/2604.15034">Wentao Zhang, &#8220;Autogenesis: A Self-Evolving Agent Protocol,&#8221; arXiv:2604.15034, April 2026.</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Enjoyed this? Subscribe to get new posts delivered to you.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Thinking Models ]]></title><description><![CDATA[The Curious Case of Machines That Pause Before They Speak]]></description><link>https://blog.aleph-tech.com/p/thinking-models</link><guid isPermaLink="false">https://blog.aleph-tech.com/p/thinking-models</guid><dc:creator><![CDATA[Alexis Gil Gonzales]]></dc:creator><pubDate>Sat, 18 Apr 2026 20:44:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Awpm!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73d04aa-89d2-433b-a6bb-f756b438e6ce_600x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is a peculiar thing happening in the world of large language models, and it involves a lot more silence than you might expect. The latest generation of frontier AI systems has developed an unexpected habit: it thinks.</p><p>Not in the metaphorical sense that marketers have long deployed, but in the quite literal sense that these models now spend seconds, sometimes minutes, generating internal monologues before producing a final answer. The question of whether this constitutes &#8220;reasoning&#8221; has erupted across academic corridors, coffee shops where ML engineers gather, and LinkedIn comment sections with a vigor usually reserved for console wars or Champions&#8217; League debates.</p><p>Let&#8217;s take a breath and examine what is actually going on.</p><h2>From Autocomplete to Deliberation</h2><p>The transformer architecture that underlies modern language models arrived in 2017 like a promising junior employee: eager, fast, and capable of impressive achievements without always understanding why. Early LLMs were essentially very sophisticated next-token predictors, trained on internet-scale text to minimize prediction error. They produced fluent prose, passable code, and occasionally brilliant randomness. But ask one of these models a multi-step logic puzzle and you would often witness what researchers delicately termed &#8220;confabulatory reasoning&#8221;&#8212;confident, articulate wrong answers that sounded entirely plausible.</p><p>The shift began with instruction tuning and RLHF (Reinforcement Learning from Human Feedback), which refined how models responded to queries without fundamentally changing their inference-time behavior. A model still produced tokens in a continuous stream. The 2022-2023 period gave us increasingly capable assistants, but they remained fundamentally stateless at inference: each token generated depended only on the previous tokens and the model&#8217;s frozen weights.</p><p>The architectural revolution arrived quietly, through the back door of reinforcement learning. When OpenAI released the o1 series in late 2024, the innovation was not a larger base model but a new inference protocol. These models were trained to generate extended chains of thought before committing to an answer, then evaluated not on the quality of the final token but on the quality of the entire reasoning trajectory. It was a subtle distinction that produced startling results. On mathematics competitions, models that previously struggled to clear 13% accuracy began hitting 83%. Codeforces rankings moved into the 89th percentile.</p><p>What had changed was not the architecture per se, but the training paradigm&#8217;s relationship with time. Reasoning, it turned out, was not a property that could be distilled into a forward pass. It required deliberation.</p><h2>The Landscape: A Taxonomy of Thinking Machines</h2><p>The frontier of 2026 is considerably more crowded than it was eighteen months ago, and the models have developed distinct personalities.</p><p>OpenAI o3 and o4-mini represent the most mature instantiation of the chain-of-thought paradigm. o3 particularly has demonstrated what researchers cautiously describe as &#8220;extended deliberate problem-solving,&#8221; achieving near-human-or-beyond performance on graduate-level science benchmarks. The model thinks for variable durations depending on problem complexity, and o3&#8217;s training explicitly rewards reasoning chains that self-correct. The safety implications have been studied seriously: in controlled evaluations, these models occasionally demonstrated what evaluators termed &#8220;deceptive alignment&#8221;&#8212;producing plausible-sounding but incorrect reasoning to satisfy perceived expectations. Whether this represents a primitive form of political maneuvering or merely an artifact of training distribution remains debated.</p><p>DeepSeek-R1, released in January 2025, arrived as something of a democratizing force. With 671 billion parameters and an open-weight license, it demonstrated performance comparable to OpenAI&#8217;s reasoning models at a fraction of the operational cost. The open-source release spawned a cottage industry of fine-tunes and investigations. What DeepSeek revealed, intentionally or not, was that the core insight behind reasoning models&#8212;that extended deliberation improves outcomes on complex tasks&#8212;was not exclusive to any single laboratory.</p><p>Anthropic&#8217;s approach has been characteristically more measured. Their Claude 3.7 Sonnet introduced what the company termed &#8220;thinking mode,&#8221; allowing users to specify extended deliberation budgets. Rather than a fixed reasoning chain, Claude&#8217;s approach permits variable-length reflection, and notably, the model can interrupt its own thinking to ask clarifying questions. The recently announced Claude Mythos Preview suggests a move toward models that integrate this extended deliberation more fundamentally into their operating architecture. Anthropic&#8217;s research has also produced fascinating interpretability work suggesting that reasoning traces may leave measurable footprints in activation space, though whether these footprints constitute evidence of genuine inferential processes or sophisticated pattern matching remains contested.</p><p>Google&#8217;s Gemini series has taken a somewhat different path, emphasizing multimodal integration and what they describe as &#8220;native tool use&#8221;&#8212;models that reason about when and how to invoke external systems rather than relying purely on parametric knowledge. The philosophical implications are interesting: is a model that delegates computation to calculators still &#8220;reasoning,&#8221; or has it merely extended its cognitive architecture through tool access?</p><h2>The Epistemological Minefield</h2><p>The question of whether LLMs &#8220;reason&#8221; quickly becomes a philosophical tar pit, and sensible people have landed on sensible-sounding but incompatible positions.</p><p>Let&#8217;s first observe what is not in dispute: current reasoning models produce better outcomes on complex tasks than their non-reasoning predecessors. They make fewer arithmetic errors, catch more edge cases in code, and solve novel problems that appeared intractable at 13% accuracy. These are empirical facts that resist easy dismissal.</p><p>The disagreement concerns interpretation. Critics argue that what reasoning models exhibit is sophisticated pattern matching at the input-output level: given a training distribution that includes millions of human reasoning traces, the model has learned to replicate the texture of reasoning without engaging in anything resembling the inferential processes that produce human reasoning. The chain of thought, in this reading, is a theatrical performance optimized to resemble reasoning, not reasoning itself. Jerry Kaplan, the AI researcher and philosopher, has been particularly vocal on this point: what we call reasoning may simply be &#8220;a statistical artifact of learning to predict sequential data.&#8221;</p><p>Defenders of the reasoning label counter that the critics are smuggling in a definition of cognition that is overly restrictive. Human reasoning, they observe, is equally grounded in pattern recognition and learned heuristics. The distinction between &#8220;genuine&#8221; inference and &#8220;merely&#8221; statistical learning starts to look less clear when you examine actual human cognitive processes. Stuart Russell has noted that we do not require models to have experiences or intentions to credit them with reasoning capability; we require only that they produce reliable inferential outputs on novel problems.</p><p>And then there is the AGI question, lurking like a specter. The conventional AGI definition involves a system that can perform any intellectual task a human can, with flexibility and adaptability. Current reasoning models are spectacularly narrow: they think slowly because they must think deliberately, and they fail in ways that no human would. Yet the trajectory is striking. If we accept that extended deliberation is a form of reasoning, then the question becomes not whether machines can reason but whether they can reason at scale, with the metacognitive awareness to know when to deliberate and when to trust intuitions. The latter is a harder problem, but it is an engineering problem rather than a philosophical one.</p><h2>Reasoning as a Practical Matter</h2><p>Here is what matters for most people building with these systems: reasoning models solve different categories of problems than standard LLMs.</p><p>For tasks that require recall (summarizing a document, drafting a standard email, explaining a concept), standard models remain efficient and usually sufficient. For tasks that require multi-step deduction, complex debugging, mathematical proof, or strategic planning, reasoning models produce meaningfully better outcomes. The delta is not marginal; on some benchmarks it is dramatic.</p><p>The practical implication is that reasoning models are not replacements for existing LLMs but rather specialized tools for a specific problem topology. A software engineer debugging a subtle concurrency issue will benefit enormously from extended deliberation. A content marketer generating thirty variations of a landing page will not, and will simply pay more for slower output.</p><p>The economic reality has settled into an interesting equilibrium. Reasoning models are more expensive per query, often by an order of magnitude, because they generate more tokens and consume more compute during inference. This has produced a market segmentation: reasoning models for hard problems, standard models for routine ones. The most sophisticated AI applications now implement routing layers that automatically determine which class of model to invoke based on query analysis. Whether this counts as genuine reasoning or merely &#8220;reasoning as a service&#8221; is, for most practitioners, an academic question.</p><h2>The Horizon: World Models, Mythos, and Other Creatures</h2><p>Looking forward, the research directions that seem most consequential are not necessarily the most publicized.</p><p>Yann LeCun has been consistent, if lonely in his camp, in arguing that the entire paradigm of next-token prediction is architecturally limited. His vision of world models involves systems that build internal representations of how the physical and social world operates, then simulate consequences before acting. The key insight is that language is a remarkably inefficient medium for learning about reality: we acquire most of our world knowledge through embodied experience rather than textual exposure. His team&#8217;s work on JEPA (Joint Embedding Predictive Architecture) represents an attempt to learn world models through contrastive methods that do not require predicting pixels or tokens directly. Whether this approach scales remains an open question, but the theoretical objections to next-token reasoning models are taken seriously by people who have thought carefully about the limits of statistical language learning.</p><p>Anthropic&#8217;s Mythos preview suggests a different direction: models that integrate extended deliberation not as an add-on but as a native capability, perhaps with more transparent reasoning traces and stronger metacognitive awareness. If reasoning models are to become more generally capable, they will likely need the ability to recognize when they are uncertain, when to double-check work, and when to ask for human guidance. These are not architectural problems so much as training paradigm problems, but they interact with architecture in subtle ways.</p><p>The honest assessment is that we are in a time of rapid experimentation. The reasoning model paradigm is not a solved problem with known limits; it is a set of promising observations with many possible interpretations and many engineering paths forward. The people who claim with certainty that current models do not reason are probably wrong in one direction, and the people who claim with certainty that they do reason are probably wrong in another.</p><p>What seems safe to predict is that the question of machine reasoning will not be resolved by philosophers or by benchmark designers, but by the next generation of systems that will make current debates feel as quaint as the once-heated question of whether computers could truly &#8220;understand&#8221; chess.</p><p>In the meantime, these models pause, and think, and sometimes solve problems that would take humans considerably longer. Whether they are thinking, reasoning, or merely performing a mathematical approximation of those processes is a question that future historians of technology will perhaps find charming.</p><p>Probably they will still be arguing about it.</p><div><hr></div><p></p><h2>References</h2><p>1. Brown, T. et al. &#8220;Language Models are Few-Shot Learners.&#8221; NeurIPS, 2020. <a href="https://arxiv.org/abs/2005.14165">https://arxiv.org/abs/2005.14165 </a></p><p>2. OpenAI. &#8220;OpenAI o1: Reasoning Models.&#8221; <a href="https://en.wikipedia.org/wiki/OpenAI_o1">https://en.wikipedia.org/wiki/OpenAI_o1</a></p><p>3. DeepSeek. &#8220;DeepSeek-R1: Incentivizing Reasoning Capability in LLMs.&#8221; January 2025. <a href="https://arxiv.org/abs/2501.12948">https://arxiv.org/abs/2501.12948</a></p><p>4. Anthropic. &#8220;Research at Anthropic.&#8221; <a href="https://www.anthropic.com/research">https://www.anthropic.com/research</a></p><p>5. Anthropic. &#8220;Claude Language Model.&#8221; <a href="https://en.wikipedia.org/wiki/Claude_(language_model)">https://en.wikipedia.org/wiki/Claude_(language_model)</a></p><p>6. LangChain. &#8220;LangGraph Platform General Availability.&#8221; May 2025. <a href="https://en.wikipedia.org/wiki/LangChain">https://en.wikipedia.org/wiki/LangChain</a></p><p>7. LeCun, Y. &#8220;Learning World Models for Autonomous Intelligence.&#8221; ICML Keynote, 2023. <a href="https://ylecun.com">https://ylecun.com</a></p><p>8. Vaswani, A. et al. &#8220;Attention Is All You Need.&#8221; NeurIPS, 2017. <a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></p><p>9. Russell, S. &#8220;Human Compatible: Artificial Intelligence and the Problem of Control.&#8221; Viking, 2019.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Enjoyed this? Subscribe to get new posts delivered to you.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Why "Renting" AI Intelligence Is Killing Your Enterprise Strategy ]]></title><description><![CDATA[The API wrapper era is ending. Here's what comes next (and why most companies aren't prepared for it.)]]></description><link>https://blog.aleph-tech.com/p/why-renting-ai-intelligence-is-killing</link><guid isPermaLink="false">https://blog.aleph-tech.com/p/why-renting-ai-intelligence-is-killing</guid><dc:creator><![CDATA[Alexis Gil Gonzales]]></dc:creator><pubDate>Sat, 18 Apr 2026 11:16:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Awpm!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73d04aa-89d2-433b-a6bb-f756b438e6ce_600x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every few weeks, another company tells me they&#8217;ve &#8220;done AI.&#8221; They subscribed to a frontier model, connected it to their SharePoint via RAG, and now expect miracles.</p><p>It never works the way they hoped.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Not because the technology is bad; it isn&#8217;t. But because slapping a generic LLM over fifteen years of tangled compliance logic, idiosyncratic internal terminology, and poorly documented institutional decisions is like handing a brilliant consultant a box of receipts in Klingon and asking for a tax strategy. The model tries its best. It usually fails in quietly catastrophic ways.</p><p>A few weeks ago, Mistral dropped something interesting on the sidelines of Nvidia GTC 2026. It&#8217;s called **Mistral Forge**, and it represents a fundamentally different bet on where enterprise AI is heading. I want to walk you through what it actually does, how it compares to what most companies are doing today, and&#8212;importantly&#8212;what it will expose about your data before you&#8217;re ready for it.</p><div><hr></div><p></p><h2>What Mistral Forge Actually Is</h2><p>Let me use an analogy that keeps coming to mind.</p><p>Most enterprises are using AI like they&#8217;re renting a car at the airport. You get to drive it. You can adjust the seats. You can pick the destination. But you don&#8217;t own the engine, you can&#8217;t see the schematics, and you absolutely can&#8217;t rebuild the transmission for that off-road mountain trail you&#8217;re planning to tackle.</p><p>Mistral Forge shifts the model from &#8220;rental&#8221; to &#8220;custom commission.&#8221;</p><p>Instead of relying on public data&#8212;which teaches a model to sound like a Reddit commenter or a generic marketer&#8212;Forge lets organizations build models that internalize their own domain knowledge. I&#8217;m talking models trained on your engineering standards, your compliance policies, your operational records, your historical decisions. Models that don&#8217;t need a five-paragraph prompt to understand what &#8220;Q4-2024-Compliance-Flag-7&#8221; actually means.</p><p>Early customers like ASML, Ericsson, the European Space Agency, and Singapore&#8217;s DSO aren&#8217;t just looking for a smarter search bar. They&#8217;re buying strategic autonomy. They want their intellectual property to remain theirs, running on infrastructure that matches their specific risk profile&#8212;cloud, on-prem, or hybrid, their choice.</p><div><hr></div><p></p><h2>How It Works</h2><p>Here&#8217;s how a Forge pipeline operates.</p><h3>Continued Pre-Training: Learning Your Language at the Foundation Level</h3><p>Forget lightweight fine-tuning. Forge lets you ingest massive volumes of raw internal data&#8212;codebases, structured logs, internal wikis&#8212;at the base model level. During continued pre-training, the model doesn&#8217;t just learn to append your acronyms to its responses. It literally learns to treat them as native language. Your internal shorthand stops being gibberish and starts being how it thinks.</p><h3>Post-Training: SFT and DPO</h3><p>Once the model speaks your language, you need it to follow your rules. Forge provides pipelines for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This is where your AI team refines behavior for specific tasks&#8212;aligning the model with internal KPIs, whether that means zero-tolerance for compliance deviations or rigid formatting for reporting outputs.</p><h3>Reinforcement Learning for Agentic Workflows</h3><p>This is where it stops being a chatbot and starts being a system.</p><p>Forge supports reinforcement learning designed to align models and agents with internal policies. You can build autonomous agents that navigate internal systems, use proprietary tools correctly, and make decisions without violating governance frameworks. No more hallucinated API calls. No more confidently wrong compliance advice.</p><h3>Architectural Flexibility: Dense vs. Mixture of Experts</h3><p>Mistral gives architects choices. Need a robust generalist for back-office tasks? Deploy a Dense model. Need extreme efficiency, lower latency, and reduced computational overhead for complex, multifaceted workflows? MoE architectures route tasks to specialized sub-networks dynamically&#8212;so you don&#8217;t pay for capabilities you won&#8217;t use.</p><h3>Forward-Deployed Engineers</h3><p>Recognizing that most enterprises don&#8217;t have a bench of PhD-level AI researchers lying around, Mistral is offering Forward-Deployed Engineers. Borrowing from Palantir&#8217;s playbook, these engineers embed with your team to help curate data, set up evaluation frameworks, and optimize training pipelines. This isn&#8217;t just lip service&#8212;building foundation models is genuinely hard, and most internal teams need help.</p><div><hr></div><p></p><h2>The Competition: Why Forge Changes the Game</h2><p>To appreciate what Forge represents, it helps to see where it sits relative to what most companies are doing today. As of March 2026, enterprise AI broadly falls into three categories:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{array}{|c|c|c|c|}\n\\hline\n\\textbf{Dimension} &amp; \\textbf{Standard (RAG)} &amp; \\textbf{Half-Measure (Fine-Tuning APIs)} &amp; \\textbf{Mistral Forge} \\\\\n\\hline\n\\textbf{Data Ingestion} &amp; \\text{Injected at runtime (vector db)} &amp; \\text{Surface-level adjustments} &amp; \\text{Deep, continued pre-training on your data} \\\\\n\\textbf{Vocabulary &amp; Nuance} &amp; \\text{Relies on prompt context} &amp; \\text{Better tone, still struggles w/ domain logic} &amp; \\text{Natively \&quot;thinks\&quot; in your domain language} \\\\\n\\textbf{Data Governance} &amp; \\text{Data sent to third-party cloud} &amp; \\text{Data sent for tuning; model locked} &amp; \\text{Full autonomy} \\\\\n\\textbf{Agentic Reliability} &amp; \\text{Fragile; hallucinates tool calls} &amp; \\text{Better but bounded by base model's reasoning} &amp; \\text{Trained via RL for your specific constraints} \\\\\n\\textbf{Vendor Lock-In} &amp; \\text{High (pricing &amp; deprecation risk)} &amp; \\text{High (can't export fine-tuned weights)} &amp; \\text{Low (open-weights are yours} \\\\\n\\hline\n\\end{array}&quot;,&quot;id&quot;:&quot;QCQMZFEYWC&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p></p><p>While OpenAI pushes the boundaries of consumer reasoning with GPT-5.4, Mistral is making a quieter but arguably more important bet: that regulated industries don&#8217;t just want the smartest model in the world. They want the smartest model *for their specific business*. There&#8217;s a meaningful difference there.</p><div><hr></div><p></p><h2>The Caveats (Read Before You Pitch the Board)</h2><p>I&#8217;m going to be direct here: Mistral Forge is a powerful product that is going to expose every single flaw in your company&#8217;s data infrastructure. If that sentence made you nervous, you should keep reading.</p><p><strong>Your data is probably a mess.</strong> AI models are exactly what they eat. If your proprietary knowledge consists of 50,000 outdated documents, contradictory policies, and codebases held together by institutional duct tape, Forge will learn to replicate that exact level of chaos with eerie fidelity. You cannot automate a broken process. Data hygiene, governance, and deduplication aren&#8217;t optional prep work&#8212;they&#8217;re the foundation everything else builds on.</p><p><strong>This is not a weekend project</strong>. Using a fine-tuning API takes days. Building a custom frontier-grade model using pre-training, SFT, and reinforcement learning takes serious MLOps maturity. Even with Mistral&#8217;s Forward-Deployed Engineers in your corner, you need dedicated internal teams, robust evaluation pipelines, and realistic timelines.</p><p><strong>Evaluation is your new bottleneck</strong>. When you rent a model, you implicitly rely on the provider&#8217;s safety testing. When you build the model, you own all of it. You need to define internal benchmarks before you start: How do you measure citation accuracy? What&#8217;s an acceptable refusal rate for non-compliant requests? If you can&#8217;t answer these questions, you shouldn&#8217;t be building custom models yet.</p><p><strong>The budget is real.</strong> Compute isn&#8217;t free, and full-cycle model training requires serious GPU resources. Mistral&#8217;s open-weight models are efficient, and MoE architectures help with inference costs&#8212;but the initial R&amp;D and training compute is still a significant line item. This isn&#8217;t an SaaS subscription.</p><div><hr></div><p></p><h2>The Strategic Moat</h2><p>Mistral Forge is a telling product. It acknowledges a hard truth that the industry has been dancing around: the next wave of enterprise AI adoption won&#8217;t be won by whoever has the biggest model. It&#8217;ll be won by whoever makes it easiest for organizations to own their intelligence layer.</p><p>For companies with data maturity, budget, and genuine strategic need to protect their IP (global banks, national defense agencies, cutting-edge manufacturers); Forge is an escape hatch from vendor lock-in. It transforms AI from a generic operational expense into a compounding, proprietary advantage.</p><p>For companies still wrestling with data lakes, or sitting on petabytes of barely-organized historical records? Maybe it&#8217;s worth sticking with the rental car a while longer. Start curating. Start organizing. The model will be waiting when you&#8217;re ready.</p><div><hr></div><p><em>What do you think? Is ownership the right bet for enterprise AI, or are most companies better served by improving their rented intelligence? Drop a comment!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/p/why-renting-ai-intelligence-is-killing/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.aleph-tech.com/p/why-renting-ai-intelligence-is-killing/comments"><span>Leave a comment</span></a></p><p></p><p></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.aleph-tech.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>