Container
We can also use containers to run AI models.
A popular choice is using llama.cpp to run models. llama.cpp
provides an OpenAI compatible API to interact with the model.
In the Docker compose file below, model file of Qwen3-0.6B
is downloaded from Hugging Face, then llama.cpp
is started to serve this model.
Compose file to run AI models
services:
model-runner:
image: ghcr.io/ggml-org/llama.cpp:server
volumes:
- model-files:/models
command:
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
- "-n"
- "512"
- "-m"
- "/models/Qwen3-0.6B-Q8_0.gguf"
ports:
- "8180:8080"
depends_on:
model-downloader:
condition: service_completed_successfully
model-downloader:
image: ghcr.io/alexcheng1982/model-downloader
restart: "no"
volumes:
- model-files:/models
command:
- "hf"
- "download"
- "unsloth/Qwen3-0.6B-GGUF"
- "Qwen3-0.6B-Q8_0.gguf"
- "--local-dir"
- "/models"
volumes:
model-files:
See the GitHub repo below for more details about running models in a container.