26.02.2026

Welcome, Heterogeneous Intelligence

The hardest problems in AI are not single-step questions with clear answers. They are long-horizon, multi-turn and open-ended - unfolding across environments that change with every interaction. They are also naturally heterogeneous; they need diverse reasoning and action with agents that explore, verify, synthesise and act on the world, coordinating in real time to collectively solve problems no single agent could handle alone. To have real economic impact, these problems must be solved quickly and cheaply - often simultaneously. This is a systems problem requiring the coordination of interacting models, not just an individual model in isolation.

Today we're sharing early discoveries from the infrastructure we built to solve these kinds of problems - infrastructure that co-evolves heterogeneous chips and intelligence, charting a new direction for AI compute. Here we demonstrate results across today's heterogeneous silicon including AWS Trainium, Cerebras, SambaNova, and others.

Callosum is a vertically integrated Intelligent System: dynamic workflows and agents optimised down to kernels and silicon. Our workflows are aware of the hardware they're running on. Our models are aware of the task graph they're serving. Our kernels are aware of the output constraints the workflow requires. Every layer sees the others, each one co-optimised in context of the whole - across heterogeneous models and heterogeneous chips simultaneously. Our conviction is that the next substantial leaps in AI capability, cost and speed will come from this heterogeneity - mixed models on mixed hardware, orchestrated end-to-end, co-evolved to exploit their differences. We call this paradigm Heterogeneous Intelligence.

Across four categories of problems - Deep context, Open web, Cache-intensive and Tool calling - we break state-of-the-art, delivering fundamental improvements in cost and speed, through a single stack. Our infrastructure makes it possible to flexibly tailor capability, speed and cost to optimise for the particular objectives of the user. In many cases, this unlocks orders-of-magnitude improvements, turning benchmark capabilities into systems that are economically viable solutions for real-world challenges. Keep in mind, this is the worst our infrastructure will ever be.

Deep Context Problems - 12x cheaper, 5.5x faster with heterogeneous recursion

Heterogeneous recursion uses mixed models and mixed silicon to solve long-chain reasoning faster and cheaper than any single configuration.

Deep context problems are everywhere in production AI: sifting through large, dynamically changing bodies of information, crunching databases requiring sustained reasoning over each entry, rapidly selecting task-relevant information at the timescale the task demands. The context itself becomes an environment to be actively explored, as it needs to be chunked, navigated, and reasoned over.

A single workflow here contains fundamentally different kinds of computation: rapid generation, deep retrieval, long-range coherence verification, branching evaluation. Each has a different memory profile, a different latency requirement, a different optimal hardware target.

We decompose these workflows across a hierarchy of recursive language models (RLMs), which - in the naive case - are models that call themselves iteratively to solve problems beyond the reach of a single forward pass, generating partial results, evaluating them, then recursing to refine, expand, or branch. Heterogeneous recursion takes this further: rather than one model recursively calling the same model on the same hardware, we decompose the recursion across a deliberately-selected & diverse set of models and silicon. A seed model sets the plan and coherence constraints while sub-models handle the expansion, retrieval, and verification, each dispatched to the hardware that fits its compute profile.

Diagram showing heterogeneous recursion: a user query enters a root model on Chip A, which decomposes it into sub-tasks distributed across different models on different chips (B and C), each with their own query, context, and response flows, recursing to arbitrary depth before returning a final response. — Heterogeneous recursion across chips and models. A root agent on Chip A decomposes a user query into sub-task sets, each routed to specialised model-chip pairings (Model B on Chip B, Model C on Chip C) that can further recurse to depth n. Sub-responses propagate back to produce the final output.

We found that the way you partition the recursion: which model handles which level, on which chip, and at which point in the reasoning chain, matters more than the raw capability of any individual component. Get the decomposition right, and you can unlock frontier-level performance at a fraction of the cost and speed.

By enabling heterogeneous recursion across a flexible range of configurations (many different models, at different depths, on distinct silicon) we unlock a configuration space that single-model systems can't access. We found that many configurations achieve comparable accuracy, but at very different price and speed points. Cerebras Llama-70B delivers the same accuracy at 5.5x the speed and 4.8x lower cost than GPT-5 as a recursive language model. SambaNova Llama-70B matches it at 8.8x lower cost. Cerebras GPT-OSS-120B delivers the same accuracy at 5x the speed and 7x lower cost than GPT-5. SambaNova GPT-OSS-120B matches that accuracy at 12x lower cost. No single configuration dominates across every objective. Our infrastructure discovers these differences automatically across the OOLONG dataset, routing each phase of the workflow to the silicon where it performs best and giving the user the ability to optimise for the tradeoff that matters to them.

Scatter plot comparing recursive language model configurations across wall clock time (x-axis, seconds) and cost per task in USD (y-axis), with bubble size indicating performance level (lower, medium, higher). All models are seeded with GPT-5. OpenAI GPT-5 sits in the upper-right at approximately 2000 seconds and $3.20, while Cerebras and SambaNova configurations running Llama-70B, Llama-8B, and GPT-OSS-120B cluster in the lower-left between 200–1000 seconds and $0.10–$1.30, achieving comparable or higher performance at a fraction of the cost and speed. OpenAI GPT-5-Mini and GPT-5-Nano appear slower and more expensive than the heterogeneous alternatives despite lower performance. — No single heterogeneous recursion configuration dominates across every objective. Each point represents a different model-silicon pairing executing the same recursive language model workflow on OOLONG. Cerebras Llama-70B delivers comparable accuracy to recursive GPT-5 at 4.8× lower cost and 5.5× faster; SambaNova Llama-70B at 8.8× lower cost. Our infrastructure discovers these tradeoffs automatically, letting users optimise for what matters to them.

So what does this look like in practice? One of our partners, Coworker AI, whose autonomous agents handle millions of complex enterprise workflows, deals with this every day: generating status reports from raw activity logs, where contexts are long and noisy, signals are sparse, and the workflow demands retrieval, attribution, categorisation and strict formatting. These are precisely the kind of deep context problems heterogeneous recursion solves. We benchmarked their results on GitHub activity logs from the public vLLM repository. Callosum infrastructure clearly beats single-call Claude Opus 4.5 baselines across the board. At 30k tokens - well within range for a single call - Callosum is 2.1x cheaper with a +13 point quality improvement due to more accurate attribution of commits to contributors, fewer hallucinated entries and tighter adherence to the required output format. At 200k tokens, where single-call approaches degrade as critical signals drown in noise, we are 1.8x cheaper and +13.5 points higher quality.

Scatter plot comparing in-production single-models (dark blue) against Callosum's heterogeneous recursion (green) on GitHub activity log summarisation at three context lengths: 30k, 88k, and 200k tokens. At every context length, the green Callosum points sit higher (better performance) and further left (lower cost) than the corresponding blue baselines. The performance gap widens as context grows. At 200k tokens, the single-model baseline drops to ~75 performance at ~$1.10, while Callosum achieves ~88 performance at ~$0.60. — Benchmarked on a real production task with our enterprise partners generating status reports from GitHub activity logs. Heterogeneous recursion beats single-call Claude Opus 4.5 at every context length, delivering higher quality at lower cost. The relative advantage compounds as contexts grow: at 200k tokens, where single-call approaches degrade as critical signals drown in noise, Callosum is 1.8× cheaper with a +13.5 point quality improvement.

Deep context isn't a frontier model problem. It's an orchestration problem - and heterogeneous recursion is how we're solving it.

Open Web Problems - Breaking SOTA while being over 3x cheaper and faster with heterogeneous vision-language-action systems of agents

Heterogeneous vision-language-action with active perception and multi-modal memory break state-of-the-art and generalises to cheaply automating payments.

In the previous section, the environment our infrastructure addressed can be thought of as a document - long, noisy, but ultimately relatively static text. The internet is a much more complex open-ended environment. Here, a single task can require visual perception, text comprehension, long-horizon planning, precise spatial targeting, and real-time adaptation to a live interface, all within a single workflow. Pages load unpredictably, layouts shift, elements appear and disappear. Crucially, no amount of reasoning alone can solve these problems. They require active perception, action, and continuous interaction with an environment that changes with every step.

Callosum's infrastructure treats the web the way it actually is - a heterogeneous environment requiring heterogeneous intelligence. The system maintains trajectory memory, allowing it to look back across its own history to verify decisions and replan when things go wrong. It holds visual working memory, tracking the precise spatial layout of the page so it can target the right button, the right link, the right field. It switches fluidly between visual and textual perception depending on what each decision demands such as reading a price as text, recognising a product from its image, interpreting a layout spatially. We gave our system an Umwelt - in biology, the perceptual world an organism constructs from what it needs to know. It doesn't perceive the web passively. It actively reshapes its own perception to match the task: switching to raw HTML when it needs structured data, zooming into a region to verify a UI element, scrolling to reveal what's hidden.

Through our system's discovery process, we quickly found that not every step needs the same model. Planning requires a large vision-language model with global context but verifying the coordinate of a small UI element (e.g., a rating button, a product link) does not. Before executing each action, the system zooms into the target region and routes this verification to a much smaller 8B model. This is heterogeneous intelligence at the level of individual actions: decompose the workflow, match each step to the smallest model that can solve it, and the gains compound across every decision in the trajectory.

With our heterogeneous approach, we set a new state-of-the-art on Visual WebArena shopping tasks, surpassing the ICLR 2026 SOTAs with 1.18× the score of Self-Grounded Verification (SGV) and 1.25× the score of Web Agents that Learn Tools (WALT). We used only open-source vision-language-action models and no frontier API calls or proprietary models.

Bar chart showing shopping task pass rates on VisualWebArena. Early approaches using frontier proprietary models score 16% (GPT-4V) and 20% (GPT-4o). More recent systems reach 39% (AWorld, GPT-5), 53% (WALT, GPT-5), and 56% (SGV, Gemini 2.5 Flash) — the latter representing ICLR 2026 SOTA. Callosum achieves 66% using only open-source models (K2.5 + Qwen3-VL-8B-Instruct), a 25% improvement over the previous best. A dashed line marks human benchmark performance at 88%. — New state-of-the-art on VisualWebArena shopping tasks, a 25% improvement over ICLR 2026 SOTA, using only open-source vision-language-action models. Every prior system on this benchmark relies on frontier proprietary models. Callosum achieves 66% with Kimi-K2.5 and Qwen3-VL-8B-Instruct, with zero frontier API calls.

But the benchmark is the starting point, not our destination. What matters is that the system is already generalising to problems it has never seen. Consider asking our system:

"Find a robot in Amazon UK that looks similar to the one on the right of image 1, playing the ball as in image 2, from the same team of the player in image 2. Compare the price of the same product on OnBuy. If Amazon is indeed cheaper, purchase it with gift option and ship to our offices”

We actually tried it out. Here is the result:

A walkthrough of Callosum's system solving a very challenging shopping problem. — Automating a real purchase from ambiguous, multi-modal specifications - the kind of task that defeats text-only systems. We asked our system to identify a specific robot across two images, find it on Amazon UK, compare prices on OnBuy, and purchase it, all on live websites. Watch how it handles the entire chain end-to-end.

This is an inherently heterogeneous task with multiple modalities, visual and text input processed by different network branches, long-horizon planning, self-verification and replanning across a live website. Text-only models cannot solve this and standard agentic frameworks struggle. Our system solves it because the same heterogeneous capabilities that broke the benchmark generalise.

Simply moving from homogeneous to heterogeneous configurations shifts the cost-latency Pareto frontier - and this holds across the open/closed model divide. By pairing a lightweight open-source vision model (Qwen3-VL-8B-Instruct) with either a frontier closed-source model (GPT-5.2) or a frontier open-source model (Kimi-K2.5), we achieve substantial reductions in both latency and cost without sacrificing task accuracy. The GPT-5.2 heterogeneous configuration is up to 3.7× cheaper (~$0.22 vs ~$0.83 per task) and 3× faster, as the smaller model not only reduces the latency and cost of zoom-in steps but also enables more accurate interactions that reduce the total number of steps required. The Kimi-K2.5 pairing shows the same pattern. The gains are model-agnostic: heterogeneous orchestration works whether the frontier model is open-source or closed-source, because the advantage is architectural, not a property of any single model.

Scatter plot comparing average latency per task (x-axis, seconds) against average cost per task (y-axis, USD) for four configurations on VisualWebArena shopping tasks. Two individual large models appear in dark blue: GPT-5.2 (closed-source) at approximately 780 seconds and $0.83, and Kimi-K2.5 (open-source) at approximately 920 seconds and $0.05. Two heterogeneous Callosum configurations appear in purple: Qwen3-VL-8B-Instruct + GPT-5.2 at approximately 260 seconds and $0.22 (3.0× faster, 3.7× cheaper than GPT-5.2 alone), and Qwen3-VL-8B-Instruct + Kimi-K2.5 at approximately 680 seconds and $0.05 (1.3× faster than K2.5 alone, 18× cheaper than GPT-5.2). Dashed arrows connect each homogeneous model to its heterogeneous counterpart, showing that orchestration shifts the Pareto frontier toward lower cost and lower latency regardless of whether the frontier model is open or closed source. — Heterogeneous model orchestration shifts the cost-latency Pareto frontier and opens orchestration across the open/closed model divide. The gains come from two sources: cheaper per-step inference and fewer total steps through more accurate interactions.

We also found that GPT-5.2 struggled particularly with reliable coordinate localisation; even with zoom-step assistance, it frequently selected wrong locations and failed far more often. Our infrastructure identified these failure points automatically and found that many were moments where the system must actively engage its environment rather than reason about it. These are precisely the steps where heterogeneity pays off most. On zoom steps alone, using a small 8B model achieved a 11x speedup and 43x cost reduction, from ~$0.0056 per zoom step with GPT-5.2 to ~$0.00013 while actually improving reliability.

Two side-by-side bar charts comparing GPT-5.2 against Qwen3-VL-8B-Instruct on zoom action sub-tasks. Left panel (Latency): GPT-5.2 takes approximately 90 seconds per zoom step; Qwen3-VL-8B-Instruct takes approximately 13 seconds, 6.7× faster. Right panel (Cost): GPT-5.2 costs approximately $0.045 per zoom step; Qwen3-VL-8B-Instruct costs approximately $0.001, 43× cheaper. — Not every action needs a big model. Routing visual verification steps to a small 8B model cuts cost by 43× and latency by 11× per step versus GPT-5.2 while actually improving overall accuracy. The gains compound across every decision in the trajectory.

The problems are heterogeneous and so the optimal system that solves them must be too. Mixed open-source vision-language-action models at different scales, each routed to the sub-task it's best suited for, orchestrated end-to-end through a single infrastructure.

Cache-Intensive Problems - 2.4x speedup with optimal eviction for heterogeneous workflows

Topology-aware cache management exploits heterogeneous complex workflow structure to eliminate redundancy.

The previous sections showed that we can discover the right decomposition of workflows across models and silicon through joint optimisation, unlocking performance that homogeneous systems cannot match. But as these systems scale - more models, more steps, more branching - a new cost compounds, redundant computation. In production, much of the work is predictable and repeated: the same pipelines, the same system prompts, the same reasoning chains executed over and over against changing inputs. Every inference call that recomputes tokens the system has already seen is wasted work.

When a language model processes a prompt, it builds an internal representation of every token - a key-value (KV) cache that stores what the model has "read" so far. This cache is expensive to compute and expensive to store. Caching lets the system reuse that work rather than rebuilding it from scratch on every call. But not every node in a workflow has the same cache profile - some carry large, frequently reused context, while others are lightweight and transient. When the cache fills up, something has to go. Today's eviction policies - Least Recently Used (LRU), Least Frequently Used (LFU) - are optimised for multi-user chat serving, not agents executing structured workflows. They treat every node identically, mistaking "used a while ago" for "no longer needed”. In an agentic workflow with predictable structure, this wastes compute. The execution graph already encodes what will be needed and when, but commonly used runtimes such as vLLM or SGLang do not exploit it.

We replace heuristic eviction with topology-aware cache management: rather than evicting what was used longest ago, we exploit the execution graph to predict what will be needed, evicting the node furthest from future use. This is an instance of Bélády’s provably optimal eviction strategy, producing the fewest possible cache misses - a principle KVFlow demonstrated applies to agentic LLM systems. We turn this insight into a practical cache management policy. The difference is immediate: in a synthetic loop of six LLM calls with cache capacity for five, LRU evicts the node accessed longest ago - which in a loop is the next one needed. On the second iteration, LRU produces six cache misses. Our approach produces one. Eviction is just the beginning. As our runtime can see the workflow's topology, the same structural awareness enables pre-fetching context before it's needed, hierarchical caching across memory tiers and intelligent scheduling across different models on different silicon.

Animated diagram comparing two cache eviction strategies for a six-agent loop with cache capacity of five over two iterations. On the left, LRU (Least Recently Used) evicts the node accessed longest ago — which in a loop is always the next node needed, producing repeated cache misses on the second iteration. On the right, Callosum's Furthest Future Use strategy evicts the node furthest from future use based on the workflow topology, resulting in far fewer misses. Nodes are colour-coded: pink for cached, green for hit, red for miss, and faded for evicted. — Standard caching blindly evicts what was used longest ago. In a loop, that's exactly the wrong thing to throw away. Left: Least Recently Used (LRU), the eviction policy in today's serving runtimes, such as vLLM and SGLang, always evicts the next node needed - six cache misses on the second iteration. Right: Callosum reads the execution graph and evicts what's furthest from future use - just one miss. Fewer misses means faster workflows.

By isolating the relationship between output-to-input tokens for this workflow, we observe speedups of up to 2.4x in prefill-dominated regimes. This clearly illustrates what topology-awareness makes possible.

Line chart comparing end-to-end latency speedup (y-axis) against token ratio of output to input tokens (x-axis) for two eviction strategies. LRU (dark blue) remains flat at 1.0× across all ratios - no speedup. Callosum's Furthest Future Use (purple) rises sharply as the token ratio decreases, reaching up to 2.4× speedup in the most prefill-dominated regimes. The shaded purple area between the two curves highlights the growing advantage as workflows become more context-heavy. — The more context-heavy the workflow, the bigger the win. Topology-aware caching delivers up to 2.4× speedup in prefill-dominated regimes - which is where most agentic work lives. As the ratio of input to output tokens grows, LRU gains nothing will the advantage for FFU compounds.

The gains here scale with the number of repetitions whenever there is a cache deficit - that is, when the workflow's cache demand exceeds the available hosted cache. This encompasses all but the most lightweight workflows, thus providing benefits where it really matters. The time savings (ΔT) scale linearly with the number of repetitions in the workflow (M) and hosted cache size (C), up to the point where C satisfies the full cache demand:

Δ T \propto M \times C

This is crucial for modern agentic workflows. The more complex and repetitive the workflow, the greater the advantage. This already extends to real-world problems. Generating a podcast-style interview between two characters, each with distinct personalities, requires detailed system prompts. Large inputs mean costly recomputation. With topology-aware eviction alone, we achieved an out-of-the-box 20% speedup - and this is before any of the deeper optimisations that topology-awareness unlocks.

A contour plot showing the relationship between cache size (C) on the x-axis and repetitions (M) on the y-axis, with contour lines colored from dark blue (low time savings) to purple/magenta (high time savings). Various AI workload types are positioned across the space: Chat Assistants at the origin (low C, low M), Context Extraction/Summarisation at high C but low M, Closed-Loop Evolutionary Search at low C but high M, Agentic Coding in the middle, Deep Research Agent at high C and moderate M, Multi-Agent Scientific Discovery at high C and high M, and Hierarchical Robotic Task Execution & Collaboration in the upper-right corner with maximum time savings. — Speedup scales with both workflow repetitions (M) and hosted-cache size (C) in any cache deficit regime, where cache demand exceeds hosted-cache size. The contour lines show that time savings grow as workloads move into the upper-right quadrant - loop-heavy, large-context regimes that characterise modern agentic systems. From simple chat assistants to multi-agent scientific discovery and robotic task execution, the more complex the system, the greater the advantage.

Topology-awareness is crucial for heterogeneous infrastructure. When workflows span different models on different silicon, each with different context lengths, memory capacities, and prefill costs, the execution graph is the only structure that unifies them. We are building a topology-aware runtime that reasons about cache residency across chips, predicts which nodes will be revisited across heterogeneous execution paths, and schedules prefills where they are cheapest. Cache eviction is where we demonstrate it first.

Tool Calling Problems - ensemble inference at near-zero overhead on custom silicon

Every class of problem we've addressed so far shares a common dependency: tool calling. It is how heterogeneous agents act on the world. When a model queries a database, triggers an API, or hands off to another agent, it does so primarily through structured outputs - valid JSON conforming to a schema that defines what actions are available. As workflows grow in complexity, tool calling becomes critical. Get it wrong and the performance breaks. Get it right but too slowly, and the economics break instead.

Today, large models produce reliable tool calls but are slow and expensive. Small models are fast and cheap but unreliable. Neither scales in production. This is due to an architectural bottleneck: constraining model output to valid structure - known as grammar enforcement - is typically done in Python, on a CPU, round-tripping over the PCIe bus between processor and accelerator. Most open-source serving stacks work this way. The overhead scales linearly with batch size - O(B) - growing over 100x from batch 1 to batch 64, consuming up to 23% of the decode step. As serving scales, so does the grammar bottleneck.

Working closely with AWS Inferentia2 silicon, we moved the entire operation on-die. JSON schemas compile into finite state machines, and a custom NKI kernel performs constrained decoding entirely in NeuronCore SBUF. The mask lives in on-chip SRAM, right alongside the logits - no round-trip, no CPU involvement. The result is O(1) scaling: grammar enforcement adds 1.4μs at batch 1 and 1.7μs at batch 64 - 1,767x faster than CPU masking at the same batch size, under 0.01% of the decode step. Every token is guaranteed structurally valid by the finite state machine with no retries, validation, or wasted tokens.

Log-scale line chart comparing grammar constraint overhead per decode step (y-axis, microseconds) against batch size (x-axis) for three implementations. CPU dense loop and CPU vectorized both scale linearly with batch size - O(B) - rising from ~30µs at batch size 1 to ~3000µs at batch size 64. On-die NKI (green) remains flat at ~1.5µs regardless of batch size - O(1) - achieving 19× faster at batch size 1 and 1767× faster at batch size 64. — Every tool call in a workflow needs valid structured output. We enforce it directly on chip. CPU-based grammar enforcement scales O(B) with batch size, making simultaneous tool calling increasingly expensive. On-die masking on Inferentia2 runs in O(1) - constant time regardless of batch size - delivering 19× less overhead at batch size 1 and 1767× at batch size 64. Zero retries, zero overhead scaling, every token guaranteed to be valid.

Despite this guarantee of structural validity, small models also struggle with semantic correctness - the actual content in the tool call being right - especially when compared to much larger models. Given this challenge, and the near-zero marginal cost of grammar enforcement as batch size increases, ensemble inferencing becomes very attractive. Under ensemble inferencing, multiple candidates generate simultaneously, each exploring different semantic completions of the same tool call. Where a single sample may be wrong, agreement across independent samples boosts quality.

This changes what small models can do. A 1B model generating eight grammar-constrained candidates with a naive selection schema achieves 42.27% cross-task accuracy on structured data extraction - pulling multiple numerical and descriptive fields from context - a +11 point improvement over a greedy pass from the same model and a +2 point improvement compared to an 8B model at an increased inference speed. The latency saved by on-die masking is what makes this possible: eight candidates run at comparable latency to a single baseline call, since the infrastructure overhead avoids scaling entirely.

Two-panel chart comparing accuracy and latency as the number of candidates increases from 1 to 8. Left panel: Context extraction accuracy rises from about 29% to 42% using a voting schema with temperature 0.7, surpassing the 8B model baseline at temp=0 (41%). A ceiling curve with perfect selection reaches approximately 55%. Right panel: Wall-clock latency remains flat at around 270ms regardless of candidate count, compared to 420ms for the 8B model at temp=0 — 1.5 times faster. Error bars shown for both metrics. — Scaling small-model ensembles on a context extraction problems. Left: a voting schema over N candidates from a 1B model (temp=0.7) matches and exceeds 8B single-pass accuracy by N=8, with a perfect-selection ceiling suggesting further gains from better aggregation. Right: latency stays nearly constant across candidate counts thanks to parallel execution, running 1.5× faster than the 8B baseline

What this teaches us is that with the right infrastructure, the quality gap between small and large models is not fixed. Small models orchestrated intelligently can deliver capability parity with significantly larger models. By writing custom kernels that compile directly onto the NeuronCore, we give small models access to these strategies that were previously uneconomical. This is what full control of the silicon makes possible. The model does not need to be smarter. It needs more shots, and the infrastructure needs to make them free. The workflow knows what format is needed. The compiler knows what tokens are valid. The kernel enforces the constraint on silicon.

This is ensemble inference for tool calling on custom silicon and only scratches the surface of what becomes possible when intelligence and silicon are co-evolved.

What comes next

Everything we've shown here is early evidence for a deeper thesis: as the problems we need to solve in the real world grow in difficulty and complexity, the systems that solve them must grow in diversity. Heterogeneous systems - diverse models on diverse hardware, co-evolved end-to-end - unlock scaling territory that homogeneous systems cannot access. In our next post, we formalise this into a theory.

The era of homogeneous scale delivered extraordinary progress. What comes next is Heterogeneous Intelligence - where models, workflows, and silicon co-evolve, and every new source of diversity makes the whole system smarter, faster, and cheaper. Every new dimension of heterogeneity we add compounds with every other. The configuration space is vast and we have only just begun to explore it.

Introducing Callosum

26.02.2026

Announcement

The Principle of Maximum Heterogeneity

26.02.2026

Paper

Welcome, Heterogeneous Intelligence

Related Content

Introducing Callosum

The Principle of Maximum Heterogeneity