← Back to Blog
·10 min read

Local LLM alternatives to OpenClaw – and how to run them on a small server

AISelf-HostedLLMOllamallama.cppServer

The landscape beyond one tool

OpenClaw is one option for self-hosted inference, but it is not the only one. The local LLM ecosystem has grown quickly, and several tools solve the same problem with different trade-offs. Which one fits depends heavily on your hardware, your use case, and how much operational overhead you are willing to accept.

This article covers the main alternatives, their strengths and weaknesses, and how to run them usefully on a small or constrained server – a VPS, a small colo machine, or repurposed office hardware.

The main contenders

Ollama

Ollama is probably the most widely used option for local inference right now. It wraps llama.cpp in a clean daemon with a REST API and a CLI, handles model downloads, and manages VRAM/RAM allocation automatically.

bash
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3:8b
ollama serve

The API is OpenAI-compatible, so any client that speaks to the OpenAI API will work with Ollama with minimal changes. It runs on Linux, macOS, and Windows.

Strengths: Easy setup, automatic GPU detection, broad model support, active community.

Weaknesses: Limited configuration depth, no built-in multi-user access control, not designed for high-concurrency production workloads.

llama.cpp

llama.cpp is the underlying inference engine that most other tools build on. Running it directly gives you more control and lower overhead.

bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Start the HTTP server:

bash
./build/bin/llama-server   --model /models/llama3-8b.Q4_K_M.gguf   --host 0.0.0.0   --port 8080   --ctx-size 4096   --n-gpu-layers 99

Strengths: Minimal dependencies, maximum control, lowest memory footprint, runs entirely on CPU if needed.

Weaknesses: No model management, no API key support out of the box, manual setup required.

LocalAI

LocalAI aims to be a drop-in replacement for the OpenAI API across multiple backends (llama.cpp, whisper, stable-diffusion, and more). It is the right choice when you need more than just text inference from a single endpoint.

bash
docker run -p 8080:8080   -v /models:/build/models   localai/localai:latest

Strengths: Multi-modal (text, speech, image), full OpenAI API compatibility, Docker-native.

Weaknesses: Higher overhead than a bare llama.cpp process, more moving parts to maintain.

vLLM

vLLM is designed for throughput. It uses PagedAttention to serve many requests concurrently with much higher efficiency than standard inference. If you are building a service that many users or processes hit simultaneously, vLLM is the right foundation.

bash
pip install vllm
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3-8B-Instruct   --host 0.0.0.0   --port 8000

Strengths: Best concurrency performance, OpenAI-compatible, supports tensor parallelism across multiple GPUs.

Weaknesses: Requires more VRAM than llama.cpp, Python dependency stack, overkill for single-user setups.

Running on a small server: what actually works

Not everyone has a server with 24 GB VRAM. The realistic small-server scenario is one of:

  • A VPS with no GPU (4–16 GB RAM)
  • A small colo machine with an older GPU (6–8 GB VRAM)
  • Repurposed desktop hardware with a mid-range consumer GPU

Here is what you can actually do in each case.

CPU-only VPS (no GPU)

CPU inference is slow but functional for low-traffic use cases: a private assistant, a summarization endpoint called a few times a day, or a development API.

Use llama.cpp or Ollama with quantized models. The key is choosing the right quantization level:

| Model size | Quantization | RAM needed | Speed (4-core VPS) |

|-----------|-------------|------------|-------------------|

| 7B | Q4_K_M | ~5 GB | ~5–10 tokens/sec |

| 7B | Q2_K | ~3 GB | ~7–12 tokens/sec |

| 13B | Q4_K_M | ~9 GB | ~3–6 tokens/sec |

For a CPU-only VPS, a 7B model at Q4_K_M quantization is the practical sweet spot. Larger models become unusably slow for interactive use.

Enable all available cores:

bash
ollama serve &
OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 ollama run llama3:8b

Or with llama.cpp:

bash
./llama-server --model llama3-8b.Q4_K_M.gguf   --threads $(nproc)   --ctx-size 2048   --n-gpu-layers 0

Small GPU (6–8 GB VRAM)

An RTX 3060 (12 GB), RTX 4060 (8 GB), or older GTX 1080 (8 GB) opens up GPU inference, which is 5–10x faster than CPU for most models.

Fit the model into VRAM by adjusting the number of GPU layers. If the model does not fully fit, llama.cpp offloads the remainder to CPU (hybrid inference):

bash
# Try full GPU offload first
./llama-server --model llama3-8b.Q4_K_M.gguf   --n-gpu-layers 99   --ctx-size 4096

# If VRAM is tight, reduce layers
./llama-server --model llama3-8b.Q4_K_M.gguf   --n-gpu-layers 20   --ctx-size 2048

Ollama handles this automatically and reports how many layers it offloaded:

bash
ollama run llama3:8b
# Output includes: "loaded X layers on GPU"

For 6–8 GB VRAM, practical model choices:

  • **Llama 3 8B Q4_K_M** (~5 GB VRAM) – good general model, fits comfortably
  • **Mistral 7B Q4_K_M** (~5 GB VRAM) – strong instruction following
  • **Gemma 2 9B Q4_K_M** (~6 GB VRAM) – excellent quality-per-size ratio
  • **Qwen2.5 7B Q4_K_M** (~5 GB VRAM) – strong for code and structured tasks

Serving it safely

Regardless of the tool, the same hardening applies as with any internal service:

  • Bind to localhost or a VPN interface only, never 0.0.0.0 on a public IP without authentication
  • Put a reverse proxy (nginx or Caddy) in front with TLS and basic auth or mTLS
  • Rate-limit the inference endpoint to prevent accidental or intentional resource exhaustion

A minimal Caddy configuration:

ai.internal.example.com {
    basicauth {
        user $2a$14$...
    }
    reverse_proxy localhost:11434
}

Choosing the right tool

| Scenario | Recommended tool |

|---------|-----------------|

| Personal assistant, easy setup | Ollama |

| Low RAM, maximum control | llama.cpp direct |

| Multi-modal (text + speech + image) | LocalAI |

| High concurrency, team usage | vLLM |

| Embedded in Python application | llama-cpp-python |

Conclusion

The self-hosted inference ecosystem has matured enough that you no longer need specialized hardware or complex infrastructure to run a useful local model. A modest server with 8–16 GB RAM can handle CPU inference for low-traffic workloads. A consumer GPU from the last two generations covers most team-scale use cases.

The choice of tool matters less than the choice of model and quantization level. Start with Ollama for its ease of use, drop down to llama.cpp when you need more control or lower overhead.

If you want to set up local inference in your environment or need help choosing the right hardware – get in touch.

Questions or feedback regarding this article?

Send Message