OpenClaw self-hosted: AI in the datacenter and on the GPU at the office

Why self-host an AI model at all?

Most teams discovering AI tooling start with a SaaS API: paste a key, make a request, done. For prototyping, that is fine. For production use inside a company, it often is not.

The core problem is data. Every prompt sent to an external provider leaves the perimeter. For internal documentation queries, code review, log analysis, or anything touching customer data, that is a compliance and confidentiality risk many organizations cannot accept. NIS2 and GDPR make this more concrete – not less.

Self-hosting solves this. The model runs on your hardware, the data never leaves your network, and you control the version, the configuration, and the access.

The architecture: two tiers

The setup I have been running combines two deployment targets:

Tier 1 – Datacenter server (always-on inference endpoint)

A dedicated server in a colocation datacenter running OpenClaw as the primary inference backend. This machine is always reachable, handles concurrent requests from multiple users or services, and runs models that fit in its VRAM or in quantized form in system RAM.

Tier 2 – Office GPU workstation (high-performance local node)

A workstation with a consumer or prosumer GPU in the office. More raw compute than the datacenter server for large models or batch jobs. Connected to the datacenter server via WireGuard VPN. Not always-on, but available during working hours and for scheduled overnight jobs.

Both tiers speak the same API. Clients do not need to know which backend handles a request.

Datacenter server setup

Hardware requirements depend heavily on which models you want to run. A practical starting point for a team of 5–15:

–A server with 32–64 GB RAM
–One GPU with 16–24 GB VRAM (e.g. NVIDIA RTX 3090, A4000, or equivalent)
–Fast NVMe storage for model weights
–Stable 1 Gbit uplink

OpenClaw installation on Debian/AlmaLinux:

bash

curl -fsSL https://openclaw.ai/install.sh | sh
systemctl enable --now openclaw

For production, run it behind a reverse proxy (nginx or Caddy) with TLS and basic authentication. Expose it only to your VPN subnet, not to the public internet.

An example nginx block:

nginx

server {
    listen 443 ssl;
    server_name ai.internal.example.com;

    ssl_certificate     /etc/ssl/certs/internal.crt;
    ssl_certificate_key /etc/ssl/private/internal.key;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        auth_basic "Restricted";
        auth_basic_user_file /etc/nginx/.htpasswd;
    }
}

VPN connection: WireGuard between datacenter and office

WireGuard is the right tool here. It is fast, simple to configure, and integrates cleanly with systemd on both ends.

Datacenter server (wg0.conf):

ini

[Interface]
Address = 10.10.0.1/24
ListenPort = 51820
PrivateKey = <datacenter-private-key>

[Peer]
PublicKey = <office-public-key>
AllowedIPs = 10.10.0.2/32

Office workstation (wg0.conf):

ini

[Interface]
Address = 10.10.0.2/24
PrivateKey = <office-private-key>

[Peer]
PublicKey = <datacenter-public-key>
Endpoint = <datacenter-public-ip>:51820
AllowedIPs = 10.10.0.0/24
PersistentKeepalive = 25

With this in place, the office workstation can reach the datacenter server at 10.10.0.1, and vice versa. Both OpenClaw instances can see each other over the private subnet.

Office GPU workstation: the heavy compute node

The office machine typically has more VRAM than the datacenter server – a workstation with an RTX 4090 (24 GB) or two GPUs can run large models at full precision that would need heavy quantization elsewhere.

Run a second OpenClaw instance here, pointing at the local GPU:

bash

OPENCLAW_HOST=10.10.0.2 openclaw serve --gpu

For routing decisions between the two backends, a lightweight proxy layer (or a simple config in your tooling) can direct large model requests to the office machine when it is available and fall back to the datacenter server otherwise.

Model management

A practical model split:

–**Datacenter server:** Smaller, quantized models (7B–13B range) for fast responses, always available, handles most everyday tasks.
–**Office GPU:** Full-precision or lightly quantized large models (30B–70B range) for complex reasoning, code generation, document analysis. Used on demand.

Pull models with:

bash

openclaw pull mixtral:8x7b
openclaw pull codestral:22b

Models are stored locally. A 7B model at Q4 quantization is roughly 4 GB; a 70B model at Q4 is roughly 40 GB. Plan your storage accordingly.

Access control and audit logging

For a team setup, access should be controlled and auditable. A few practices that work well:

–Issue per-user or per-service API keys, not a shared credential
–Log all requests at the proxy layer with user identifier, model, and timestamp
–Rotate keys on a schedule and immediately on team changes
–Keep model weights on encrypted storage

If you are using OpenClaw inside a Kubernetes cluster, deploy it as a normal workload with a service, and control access through network policies and an ingress with authentication middleware.

What this setup costs

Rough numbers for a small team setup:

–Colocation + server hardware (used): 150–300 EUR/month all-in
–Office GPU workstation (one-time): 2,000–4,000 EUR
–Electricity (office workstation, when running): ~50–80 EUR/month

Compare that to API costs at scale: at 10 million tokens per month across a team, SaaS API costs easily reach 100–500 EUR/month depending on the model – and scale linearly. The self-hosted setup has higher upfront cost but a clear cost ceiling and no per-token billing.

For teams doing significant AI workloads and handling sensitive data, the economics and the compliance picture both favor self-hosting.

Practical recommendations

–Start with the datacenter server only. Get one model working reliably end-to-end before adding the GPU node.
–Use WireGuard for the VPN – not OpenVPN, not a cloud NAT gateway. It is simpler and faster.
–Do not expose the inference endpoint to the public internet. VPN-only access is the right default.
–Monitor GPU utilization and temperature, especially on the office workstation running continuous jobs.
–Test your fallback behavior: what happens to user requests when the office GPU is offline?

Conclusion

Self-hosting AI inference is not exotic anymore. The tooling is mature, the hardware is accessible, and the operational overhead is comparable to running any other stateful service. For teams with data residency requirements or meaningful AI usage volumes, it is the more sensible choice.

The two-tier setup – always-on datacenter for everyday tasks, local GPU for heavy lifting – gives you both availability and performance without committing to one machine doing everything.

Questions about setting this up in your environment? Get in touch.