The landscape beyond one tool
OpenClaw is one option for self-hosted inference, but it is not the only one. The local LLM ecosystem has grown quickly, and several tools solve the same problem with different trade-offs. Which one fits depends heavily on your hardware, your use case, and how much operational overhead you are willing to accept.
This article covers the main alternatives, their strengths and weaknesses, and how to run them usefully on a small or constrained server – a VPS, a small colo machine, or repurposed office hardware.
The main contenders
Ollama
Ollama is probably the most widely used option for local inference right now. It wraps llama.cpp in a clean daemon with a REST API and a CLI, handles model downloads, and manages VRAM/RAM allocation automatically.
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3:8b
ollama serveThe API is OpenAI-compatible, so any client that speaks to the OpenAI API will work with Ollama with minimal changes. It runs on Linux, macOS, and Windows.
Strengths: Easy setup, automatic GPU detection, broad model support, active community.
Weaknesses: Limited configuration depth, no built-in multi-user access control, not designed for high-concurrency production workloads.
llama.cpp
llama.cpp is the underlying inference engine that most other tools build on. Running it directly gives you more control and lower overhead.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)Start the HTTP server:
./build/bin/llama-server --model /models/llama3-8b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 --ctx-size 4096 --n-gpu-layers 99Strengths: Minimal dependencies, maximum control, lowest memory footprint, runs entirely on CPU if needed.
Weaknesses: No model management, no API key support out of the box, manual setup required.
LocalAI
LocalAI aims to be a drop-in replacement for the OpenAI API across multiple backends (llama.cpp, whisper, stable-diffusion, and more). It is the right choice when you need more than just text inference from a single endpoint.
docker run -p 8080:8080 -v /models:/build/models localai/localai:latestStrengths: Multi-modal (text, speech, image), full OpenAI API compatibility, Docker-native.
Weaknesses: Higher overhead than a bare llama.cpp process, more moving parts to maintain.
vLLM
vLLM is designed for throughput. It uses PagedAttention to serve many requests concurrently with much higher efficiency than standard inference. If you are building a service that many users or processes hit simultaneously, vLLM is the right foundation.
pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B-Instruct --host 0.0.0.0 --port 8000Strengths: Best concurrency performance, OpenAI-compatible, supports tensor parallelism across multiple GPUs.
Weaknesses: Requires more VRAM than llama.cpp, Python dependency stack, overkill for single-user setups.
Running on a small server: what actually works
Not everyone has a server with 24 GB VRAM. The realistic small-server scenario is one of:
- –A VPS with no GPU (4–16 GB RAM)
- –A small colo machine with an older GPU (6–8 GB VRAM)
- –Repurposed desktop hardware with a mid-range consumer GPU
Here is what you can actually do in each case.
CPU-only VPS (no GPU)
CPU inference is slow but functional for low-traffic use cases: a private assistant, a summarization endpoint called a few times a day, or a development API.
Use llama.cpp or Ollama with quantized models. The key is choosing the right quantization level:
| Model size | Quantization | RAM needed | Speed (4-core VPS) |
|-----------|-------------|------------|-------------------|
| 7B | Q4_K_M | ~5 GB | ~5–10 tokens/sec |
| 7B | Q2_K | ~3 GB | ~7–12 tokens/sec |
| 13B | Q4_K_M | ~9 GB | ~3–6 tokens/sec |
For a CPU-only VPS, a 7B model at Q4_K_M quantization is the practical sweet spot. Larger models become unusably slow for interactive use.
Enable all available cores:
ollama serve &
OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 ollama run llama3:8bOr with llama.cpp:
./llama-server --model llama3-8b.Q4_K_M.gguf --threads $(nproc) --ctx-size 2048 --n-gpu-layers 0Small GPU (6–8 GB VRAM)
An RTX 3060 (12 GB), RTX 4060 (8 GB), or older GTX 1080 (8 GB) opens up GPU inference, which is 5–10x faster than CPU for most models.
Fit the model into VRAM by adjusting the number of GPU layers. If the model does not fully fit, llama.cpp offloads the remainder to CPU (hybrid inference):
# Try full GPU offload first
./llama-server --model llama3-8b.Q4_K_M.gguf --n-gpu-layers 99 --ctx-size 4096
# If VRAM is tight, reduce layers
./llama-server --model llama3-8b.Q4_K_M.gguf --n-gpu-layers 20 --ctx-size 2048Ollama handles this automatically and reports how many layers it offloaded:
ollama run llama3:8b
# Output includes: "loaded X layers on GPU"For 6–8 GB VRAM, practical model choices:
- –**Llama 3 8B Q4_K_M** (~5 GB VRAM) – good general model, fits comfortably
- –**Mistral 7B Q4_K_M** (~5 GB VRAM) – strong instruction following
- –**Gemma 2 9B Q4_K_M** (~6 GB VRAM) – excellent quality-per-size ratio
- –**Qwen2.5 7B Q4_K_M** (~5 GB VRAM) – strong for code and structured tasks
Serving it safely
Regardless of the tool, the same hardening applies as with any internal service:
- –Bind to localhost or a VPN interface only, never 0.0.0.0 on a public IP without authentication
- –Put a reverse proxy (nginx or Caddy) in front with TLS and basic auth or mTLS
- –Rate-limit the inference endpoint to prevent accidental or intentional resource exhaustion
A minimal Caddy configuration:
ai.internal.example.com {
basicauth {
user $2a$14$...
}
reverse_proxy localhost:11434
}Choosing the right tool
| Scenario | Recommended tool |
|---------|-----------------|
| Personal assistant, easy setup | Ollama |
| Low RAM, maximum control | llama.cpp direct |
| Multi-modal (text + speech + image) | LocalAI |
| High concurrency, team usage | vLLM |
| Embedded in Python application | llama-cpp-python |
Conclusion
The self-hosted inference ecosystem has matured enough that you no longer need specialized hardware or complex infrastructure to run a useful local model. A modest server with 8–16 GB RAM can handle CPU inference for low-traffic workloads. A consumer GPU from the last two generations covers most team-scale use cases.
The choice of tool matters less than the choice of model and quantization level. Start with Ollama for its ease of use, drop down to llama.cpp when you need more control or lower overhead.
If you want to set up local inference in your environment or need help choosing the right hardware – get in touch.