Container

We can also use containers to run AI models.

A popular choice is using llama.cpp to run models. llama.cpp provides an OpenAI compatible API to interact with the model.

In the Docker compose file below, model file of Qwen3-0.6B is downloaded from Hugging Face, then llama.cpp is started to serve this model.

Compose file to run AI models
services:
  model-runner:
    image: ghcr.io/ggml-org/llama.cpp:server
    volumes:
      - model-files:/models
    command:
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8080"
      - "-n"
      - "512"
      - "-m"
      - "/models/Qwen3-0.6B-Q8_0.gguf"
    ports:
      - "8180:8080"
    depends_on:
      model-downloader:
        condition: service_completed_successfully   

  model-downloader:
    image: ghcr.io/alexcheng1982/model-downloader
    restart: "no"
    volumes:
      - model-files:/models
    command: 
      - "hf"
      - "download"
      - "unsloth/Qwen3-0.6B-GGUF"
      - "Qwen3-0.6B-Q8_0.gguf"
      - "--local-dir"
      - "/models"

volumes:
  model-files: 

After the container is started, the model API can be accessed from http://localhost:8180.

See the GitHub repo below for more details about running models in a container.