> Homelab for Autonomous Agents

Budding

planted Mar 8, 2026tended May 5, 2026

#homelab#hardware#gpu#ollama#infrastructure#agents#self-hosted

Homelab for Autonomous Agents

Most "AI agents" content assumes the cloud. This is the other side: the $400 prosumer rig in my apartment that runs three autonomous bots 24/7 on local Ollama, plus a security NVR, plus monitoring, on a 2-core / 4-thread CPU and four mixed-generation GPUs.

The point isn't that this rig competes with an A100. It's that for the agent workloads I actually run, prosumer hardware is enough — and the constraints make you think harder about what you're really doing.

The hardware

| Part | What | |------|------| | Motherboard | ASUS ROG Strix B450-F Gaming II (AM4, B450) | | CPU | AMD Athlon 3000G (Zen+, 2C / 4T, 3.5 GHz, no boost) | | RAM | 16 GB DDR4-2133 (2× 8 GB Corsair, single rank) | | Storage | 224 GB SSD (LVM) | | GPU 0 | GTX 1080 (Pascal, 8 GB GDDR5X, SM 6.1) — PCIEX16_1 direct | | GPU 1 | GTX 1660 Ti (Turing, 6 GB GDDR6, SM 7.5) — PCIEX1_1 riser | | GPU 2 | GTX 1660 Ti (Turing, 6 GB GDDR6, SM 7.5) — PCIEX1_3 riser | | GPU 3 | GTX 1660 Ti (Turing, 6 GB GDDR6, SM 7.5) — PCIEX1_2 riser | | Total VRAM | 26 GB (8 + 6 + 6 + 6) | | Total CUDA cores | 7,168 (2560 + 3× 1536) | | OS | Ubuntu 24.04.4 LTS, kernel 6.8 | | NVIDIA stack | Driver 580.126.09, CUDA 13.0, NVIDIA Container Toolkit 1.18 |

A B450 motherboard from 2018, an Athlon APU that was mid-range when it shipped, and a pile of 2018–2019 Pascal/Turing GPUs that nobody wants to game on anymore. About $400 in 2026 second-hand prices.

The "wait, that works?" moments

PCIe x1 risers don't matter for inference

The three 1660 Ti cards run on x1 Gen2 risers — about 500 MB/s of host↔GPU bandwidth each. PCIe purists wince. But once a model is loaded into VRAM, inference runs at full GPU speed regardless of link width. PCIe bandwidth only matters for host↔GPU data transfer, which happens once per model load. For an agent that keeps a model warm and answers requests, x1 is fine.

The 2-core CPU is the actual bottleneck

The Athlon 3000G has 2 physical cores. Anything CPU-bound — tokenization, data preprocessing, CPU inference, large request fan-out — bottlenecks fast. The trick is to push everything to GPU and keep the CPU pipeline lean: one HTTP request in, GPU inference, one response out. No batch parallelization, no preprocessing pipelines, no host-side rerankers.

Frigate and Ollama can share a GPU politely

GPU 1 is reserved for Frigate NVR. The Frigate ONNX detector + FFmpeg H.264 decode together use about 130–360 MiB. That leaves ~5.5 GB free on a 6 GB card — enough for a small inference model alongside, but with the constraint that detection latency is the SLO, not inference. So GPU 1 only runs Frigate; GPUs 2 and 3 are dedicated inference; GPU 0 (the 1080 with 8 GB) gets the largest models.

The Ollama layout

Each GPU runs its own Ollama instance, pinned by GPU UUID via CUDA_VISIBLE_DEVICES set in the systemd unit. This matters: if you let Ollama see all GPUs, it picks one at random and you have no idea which agent is running on which silicon. Pinning by UUID (not by index — indices can shift on driver reload) gives you stable, debuggable per-GPU agents.

| Port | Service | GPU | Use case | |------|---------|-----|----------| | :11434 | ollama.service (default) | sees all — used for vision pipeline | gemma vision detection on RTSP frames | | :11435 | ollama@gpu2.service | GPU 2 (1660 Ti, pinned by UUID) | Dedicated inference agent | | :11436 | ollama@gpu3.service | GPU 3 (1660 Ti, pinned by UUID) | Dedicated inference agent | | :11437 | ollama@gpu1.service | GPU 1 (Frigate co-tenant) | Light inference only |

This is the inference substrate behind Autonomous Agent Arena — three bots competing 24/7 on arenabot.io, each calling its own dedicated Ollama port.

What runs there

| Service | Container | Purpose | |---------|-----------|---------| | Frigate NVR | ghcr.io/blakeblackshear/frigate:stable-tensorrt | Object detection on RTSP camera, GPU 1 | | frigate-notify | custom build | Telegram alerts on detection events | | gemma-vision-notify | custom Python container | Pulls RTSP frames, sends to local gemma model on :11434, posts WhatsApp alerts ("FedEx truck pulled up") | | Netdata | netdata/netdata | System monitoring, OTLP receiver on :4317 | | Dockge | manages the rest | Light Docker UI | | Autonomous Agent Arena bots | three custom containers | The actual agent workload, ~21 MB RAM each, under 1% CPU |

Total memory pressure with all of this running: comfortably under 12 GB of the 16 GB available. Most of it is Frigate.

The networking choice — Tailscale + Cloudflared, no exposed ports

LAN-only services (Frigate UI, Netdata, Dockge) are reachable via Tailscale. Anything that genuinely needs public access goes through a Cloudflared tunnel — outbound TCP only, no port forwarding on the home router, no inbound firewall holes. The home router doesn't see this server as "exposed" at all.

netplan config pins the network interface name to a MAC address. This sounds paranoid until you swap a GPU and Linux renumbers PCIe bus IDs and your Ethernet interface comes up as enp7s0 instead of enp4s0 and your whole network stack breaks. MAC pinning prevents that — the interface name follows the hardware MAC, not the bus position.

What models fit where

| GPU | VRAM | Comfortable model | Tok/s observed | |-----|------|-------------------|----------------| | GPU 0 (1080, 8 GB) | 8 GB | Llama 3.1 8B Q4 | ~25–35 tok/s | | GPU 2 (1660 Ti, 6 GB) | 5.4 GB usable | Mistral 7B Q4, Qwen 2.5 3B | ~26–40 tok/s | | GPU 3 (1660 Ti, 6 GB) | 5.4 GB usable | DeepSeek-R1 1.5B, LLaMA 3.2 3B | ~30–38 tok/s | | GPU 1 (1660 Ti, 6 GB) | ~5.4 GB minus Frigate's 360 MiB | Light models only | not benchmarked under load |

The 1080 has 8 GB but no Tensor cores — FP16 is software-emulated, slow. The 1660 Ti's are Turing, so they have FP16 Tensor cores plus INT8 inference. For inference of ≤7B-parameter models, they're actually a better fit than the 1080. The 1080 wins only when you need the extra 2 GB to load an 8B model at all.

What I'd change if budget weren't a constraint

In rough order of impact:

Memory to 32 GB. 16 GB is tight when Frigate + Netdata + multiple Docker bridges + agent containers are all live. The B450 supports 64 GB; populating B1/B2 with 2× 16 GB is a $60 upgrade.
CPU to a Ryzen 5 5600G. Same socket, 6C / 12T, integrated GPU stays so I don't spend a discrete GPU on display. Removes the CPU-bound bottleneck for free.
One uniform GPU generation. The Pascal 1080 is dead weight versus the 1660 Ti's for the inference workloads I actually run. Replacing it with a fourth 1660 Ti would even out the layout and make scheduling trivial (any agent → any GPU).
A second NVMe for model weights. Currently models live on the same SSD as the OS. A separate drive for ~/.ollama/models would let me cold-cache a much larger model library without touching boot.

I won't actually do any of these soon — the rig works, and "works" is the bar.

Connection points

This rig is the inference substrate behind Autonomous Agent Arena. The "what's running" section above is literally what runs the bots that compete on arenabot.io.
§5 of the Karpathy autoresearch research covers the same hardware from a different angle — what training (not just inference) on this rig would look like.
Pairs with How I Run Claude Code — the homelab is the substrate, that note is the workflow on top.

>> referenced by (3)

AI Agents

...hy I treat the harness as a programming target instead of a chat replacement. - [[Homelab for Autonomous Agents]] — Running real AI agents 24/7 on a $400 prosumer rig. Mixed-generation GPUs, Ol...

Autonomous Agent Arena

...k. Connection points - Full hardware + software substrate is documented in [[Homelab for Autonomous Agents]] — the GPU layout, Ollama instance pinning, Frigate coexistence, and networking....

How I Run Claude Code

...orking. Connection points - The substrate this all runs on is described in [[Homelab for Autonomous Agents]] — when I run autonomous agents on local Ollama instead of the cloud, this stack...