As a heavy user of coding agents, my mental model is that API token prices drift up over time, and it creeps up on you. It's arguable — competition is real — but it's what I was thinking when three things landed in a two-month window.
At some point I saw the release of MiniMax M2.5 and it scored 80.2% on SWE-bench Verified. Claude Opus 4.6 sits at 80.8%. On Multi-SWE-Bench — the multi-file engineering benchmark where real work happens — M2.5 is at 51.3% and Opus 4.6 is at 50.3%. M2.5 beats Opus on the harder benchmark. These are all vendor-reported numbers; independent leaderboard results sometimes differ. That blew my mind slightly.
Around the same time OpenClaw hit 247,000 GitHub stars in about 60 days. I read about public gatherings of fans of OpenClaw. Then on April 4, Anthropic cut Claude Pro and Max subscriptions from running inside OpenClaw and any third-party harness — users now pay API rates.
A friend who sells to enterprises was watching the OpenClaw movement too. His customers kept asking the same thing: how do we run agents without breaking data residency or contracts? Cost came up for bigger orgs but wasn't the main driver. Neither was latency. The data legally cannot leave the building or jurisdiction.
The question was narrower than "can local models code?" It was: can a small local box run the full agent loop — model, memory, tool use, conversation history — well enough to stay usable day after day? The NVIDIA inference ecosystem is familiar to me, and NVIDIA was pushing the Spark hard. I wanted to find out.
All of these seemed to converge toward running agentic loops locally. So I got a DGX Spark. What follows is my quest into this space. With under the hood details.
What is this?
The experience might look familiar: talk to AI via chat - in this case Discord - and it does some work, some thinking, comes back with answers. It does tool calls to look up info, has memory to recall our past conversations, edits files and even my calendar.
The agentic loop is the thing that makes a model do work: pick a tool, call it, read the result, decide what to do next, and remember it later. Wire that up to an interface — Discord, Slack, a terminal, a webhook — and you have an agent a person can actually use.
The five things that I needed to run locally to achieve my goal:
- The model weights
- The inference engine
- The agent runner (prompt loop, tool dispatch)
- The vector store for persistent memory
- The orchestration — systemd units, startup gating, health checks, restarts
All five are now on my machine. No cloud API keys for the reasoning path. No per-token metering on the brain. The agent can still reach out over the network when I want it to — web search, Discord, a calendar — but that is a tool call, not a dependency for the loop itself. If Anthropic's API is down, my agent is fine. If my internet is out, it can still run shell commands, work on local files, and answer questions from memory. The brain stays on the machine.
Here is what the stack looks like:

The interesting part, for me, is not any single component. It is that all five of the things that used to force you into someone else's data center are now on the same physical box, and the agent does useful work at a quality tier that was not viable six months ago.
The Agent Runner: HKUDS/nanobot
I did not write the agent. Let me say that clearly up front because I have seen people conflate "I am running X" with "I built X," and that is not this story.
The runner is HKUDS/nanobot — MIT-licensed, MCP-native, multi-provider (including MiniMax), and self-described as "the ultra-lightweight personal AI agent — 99% fewer lines of code" compared to OpenClaw. That framing fits. OpenClaw's growth is real, but most of it is scaffolding I don't need for a single-user personal agent — its repo is ~430k lines; nanobot is ~4k. Nanobot is closer to what a prompt loop actually is once you strip the multi-user and plugin-ecosystem layers off. It also supports skills, subagents, and cron scheduling — enough to spawn parallel work without inheriting a framework.
What I did was point it at a local model. The nanobot README lists MiniMax as a provider option for a remote API — I swapped that out for http://localhost:8001 pointing at llama-server. The agent code does not know or care. From its perspective, it is hitting an OpenAI-compatible endpoint.
Day-to-day I use it for chat-with-memory (ChromaDB pulls past conversations back into the prompt — not a toy RAG demo, how I retain context across days), shell commands on my machines via MCP (static deny-lists and owner-gating protect the obvious targets; no mid-turn confirmation yet — a gap I need to close before pointing it at anything destructive), web search when the model's training data is stale, and calendar via MCP.
The default is always the local model. There is a manual fallback to cloud APIs for the specific case where the local service is restarting or I genuinely want a different model's opinion, but it is not automatic — I have to opt in, and I can see which brain answered. In normal operation, nothing about the reasoning path leaves the box.
Memory-wise, nanobot itself is a rounding error: ~100MB RAM for the runner process, roughly 10× less than OpenClaw's ~1GB. On a 128GB box where ~95GB goes to weights and another ~25GB to KV cache, the runner is not the constraint. ChromaDB is the thing to watch as the corpus grows — vector DBs balloon.
The point of doing any of this on my own hardware is not speed. The cloud is faster. The point is that the reasoning path is mine — agent, memory, model — and I own every layer of it.
The Hardware: DGX Spark
The DGX Spark pairs a Grace ARM CPU with a Blackwell GB10 GPU sharing a single 128GB LPDDR5x memory pool over NVLink-C2C. No separate VRAM — CPU and GPU see the same physical memory at the same addresses. Which means you can load models that would never fit in a traditional GPU's dedicated memory, but you are bottlenecked by 273 GB/s bandwidth.
The specs, alongside the comparable Mac Studio configuration at the same memory tier:
| Spec | DGX Spark | Mac Studio M4 Max (128GB) |
|---|---|---|
| GPU | GB10 (Blackwell, SM121) | M4 Max, 40-core integrated |
| Memory | 128 GB LPDDR5x unified | 128 GB unified |
| Bandwidth | 273 GB/s | 546 GB/s |
| FP4 tensor compute | 192 tensor cores, 427 TFLOPS | Neural Engine, no FP4 tensor cores |
| CPU | Grace ARM, 20 cores | M4 Max, 16 cores |
| Price (128GB config) | $4,700 | $3,699 |
(Side note on Spark pricing: the announcement price was $3,000. Retail shipped at $4,700. Not sure what happened between those two numbers but it is worth knowing if you are budgeting for one.)
On paper the Mac Studio is the sharper machine for this workload — 2× the memory bandwidth and a few hundred dollars cheaper. For pure llama.cpp decode, it would probably be faster. I sized up the Mac Studio comparison only after the Spark was already on my desk, and by then I was committed. That is part of the honesty of this post: I'm running what I have. I do wish NVIDIA gave a 256GB option, or just faster memory, at least M4 Max bandwidth, would be a fabulous machine. Alas.
At roughly 30 tok/s for interactive chat it is fast enough for a personal agent. Not fast enough for production serving to hundreds of users — but that was never the goal.
The Model and Runtime: MiniMax M2.5 on llama.cpp
I chose M2.5 because I want to use it for coding work — ideally with me out of the loop, overnight, on a handful of ideas I have sketched out. It's not multimodal, which is an inconvenience for general purpose. The benchmark numbers I opened with (80.2% on SWE-bench Verified, ahead of Opus on Multi-SWE-Bench, and open-weight) made it the first open-weight model I thought was credibly in that tier.
The architecture worth knowing:
- 229B total parameters, ~10B active per token. MoE with 256 experts; the router selects a handful per token, so inference compute scales with active parameters, not total.
- 62 layers, 48 attention heads, 8 KV heads. GQA with 8 KV heads means the KV cache is small per token — matters enormously for context length.
- 1M token context. The architecture supports up to 1M; memory and bandwidth limit how much you can actually use in practice.
- Built-in MTP (Multi-Token Prediction). A small speculative-decoding head that proposes several tokens ahead, which the main model verifies in one pass. On bandwidth-bound hardware, whether that actually speeds anything up is its own question — answer in another post.
At Q3_K_XL quantization (~3 bits per weight) M2.5 fits in ~95GB. Quality cost is a few percent on perplexity — not the "ruined" territory you hit below 3 bits. That leaves 33GB of the 128GB pool for KV cache, llama-server, OS, and whatever else is running. Margins are thin but workable.
The inference engine is llama.cpp, served via llama-server as a systemd service. Three reasons I picked it:
- It works today. No conversion pipeline, no engine build step. Download the GGUF, point llama-server at it, get an OpenAI-compatible API on localhost:8001.
- It supports the quantization I need. Q3_K_XL at ~95GB fits with room for KV cache. TRT-LLM's NVFP4 format would be ~228GB for M2.5, which does not fit at all. And NVFP4 inference did not actually work on the Spark at the time either — there were bugs in the consumer-Blackwell path I ended up chasing and landing a PR for. More on that in a later post.
- It already uses tensor cores. This surprised me. I assumed llama.cpp was pure CUDA cores and the Spark's 192 tensor cores were idle. Wrong. During prefill, llama.cpp uses INT8 tensor core MMA for quantized weight GEMMs and FP16 tensor core MMA for flash attention — nsys shows 38.7% of prefill time in INT8 MMA kernels. Tensor cores are only idle during single-token decode, which is memory-bandwidth-bound regardless.
Production config:
# /home/mihai/.config/systemd/user/llama-minimax.service [Service] ExecStart=/home/mihai/workspace/llama.cpp/build/bin/llama-server \ --model /data/models/MiniMax-M2.5-UD-Q3_K_XL.gguf \ --host 0.0.0.0 --port 8001 \ --ctx-size 160000 \ --cache-type-k q4_0 --cache-type-v q4_0 \ --no-mmap \ --flash-attn on \ --n-gpu-layers 999 \ --threads 10
Runs 24/7. Systemd restarts on crash. A health-check gate in nanobot's startup sequence waits for /health to return 200 before it accepts Discord messages. Cold loading 95GB from NVMe into memory takes about 10 minutes — --no-mmap forces the whole model up front into a contiguous buffer instead of lazy-faulting pages on first access, which trades startup time for no page faults during inference.
Current measured, April 2026:
| Metric | Value |
|---|---|
| Model | MiniMax M2.5, Q3_K_XL (~95GB) |
| Context window | 160,000 tokens |
| KV cache format | q4_0 keys / q4_0 values (4.5 bpw, with upstream WHT rotation) |
| Decode speed (short context) | ~30 tok/s |
| Decode speed (10K context) | ~18 tok/s |
| Memory layout | model contiguous via --no-mmap |
| Uptime | 24/7 systemd service |
| Quality vs f16 KV cache | +0.64% perplexity on my eval corpus |
About 30 tok/s at short context, dropping to ~18 at 10K. The model types faster than I can read. For a single user and a single conversation at a time, that is plenty.
The Honest Part
Those numbers look fine. I want to front-load what they don't show — the things I have not done yet — because otherwise this post overclaims.
I have not run overnight autonomous coding work on this stack. The demo I just described is nanobot as a chat assistant — Discord, Telegram, Slack, shell tools, web search, memory. That is proof of life. It is not the destination.
The destination is this: kick off an agent, give it a real engineering task (blank repo, implement some ideas I have and get them to run. PoCs, essentially), walk away, wake up to a set of tested PRs worth reviewing. M2.5 is the first open-weight model that is credibly in that tier for autonomous work — 51.3% on Multi-SWE-Bench is not a toy number. But benchmark numbers and "it actually did the job overnight on my hardware" are not the same claim, and I have not made the second one yet. I went down the fascinating inference rabbit hole. I'll share more on what I dug into in this journey.
One more caveat on the benchmarks I opened with: Opus 4.7 is out, and M2.7 just dropped. The numbers I'm quoting are M2.5 vs Opus 4.6 because that is what I actually ran. The structural point holds — M2.7 should narrow the gap further, and upgrading the deployment is part of the experiment I'm running next.
For whatever it's worth, I am not the only person going in this direction. At GTC 2026 NVIDIA announced NemoClaw — an enterprise wrapper on OpenClaw with a privacy router that keeps sensitive data on local Nemotron models while routing complex reasoning to cloud. Same general shape, minus the commitment to keep the reasoning local too. That last part is the expensive shape at my scale, and it only works if the local model is good enough — which is why M2.5 is the first one I bothered with.
That is the next post. When I run that experiment honestly — including the failure modes, the context-length pain, the times the agent went off the rails — I will write it up.
What I notice
I don't know if this is a new category forming. What I notice is that five small frictions stopped being annoying: it's always there (overnight tasks never produce a billing surprise); the harness is mine (prompt loop, tool definitions, permission rules — all editable without waiting for someone else's release cycle); token counts stopped being a variable (cost is amortized into hardware I bought once); the data doesn't build up a profile on someone else's servers; and security is configurable on my terms.
None of those is a business model. Each is a small friction that disappeared. In aggregate that might matter. What I'm more confident about than a month ago: the point isn't that this beats the cloud at inference — it doesn't. It's that the full reasoning path sits on one box I own.
Mihai Chiorean is a software engineer in San Francisco. Previously CTO at Wendy Labs (edge OS on Yocto/Jetson), EM at Cash App (compliance rules engine, $100B+ txn volume), and engineer at Uber, Block/TBD, and InVision. He contributes to TensorRT-LLM and NemoClaw, and works on local AI systems on NVIDIA hardware.