跳到主要内容

Container

We can also use containers to run AI models.

A popular choice is using llama.cpp to run models. llama.cpp provides an OpenAI compatible API to interact with the model.

In the Docker compose file below, model file of Qwen3-0.6B is downloaded from Hugging Face, then llama.cpp is started to serve this model.

Compose file to run AI models
services:
model-runner:
image: ghcr.io/ggml-org/llama.cpp:server
volumes:
- model-files:/models
command:
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
- "-n"
- "512"
- "-m"
- "/models/Qwen3-0.6B-Q8_0.gguf"
ports:
- "8180:8080"
depends_on:
model-downloader:
condition: service_completed_successfully

model-downloader:
image: ghcr.io/alexcheng1982/model-downloader
restart: "no"
volumes:
- model-files:/models
command:
- "hf"
- "download"
- "unsloth/Qwen3-0.6B-GGUF"
- "Qwen3-0.6B-Q8_0.gguf"
- "--local-dir"
- "/models"

volumes:
model-files:

See the GitHub repo below for more details about running models in a container.