Self-hosting Llama 3 on a home server

7 min readApr 20, 2024

Self-hosting Llama 3 as your own ChatGPT replacement service using a 10 year old graphics card and open source components.

Last week Meta launched Llama 3, the latest in their open source LLM series. Llama 3 is particularly interesting because the 8 billion parameter model, which is small enough to run on a laptop, performs as well as models 10x bigger than it. The responses it provides are as good as GPT-4 for many use cases.

I finally decided that this was motivation enough to dig out my old Nvidia Titan X card from the loft and slot it into my home server so that I could stand up a ChatGPT clone on my home network. In this post I explain some of the pros and cons of self-hosting llama 3 and provide configuration and resources to help you do it too.

How it works

The model is served by Ollama which is a GPU-enabled open source service for running LLMs as a service. Ollama makes heavy use of llama.cpp, t he same tech that I used to build turbopilot around 1 year ago. The frontend is powered by OpenWebUI which provides a ChatGPT-like user experience for interacting with Ollama models.

I use docker compose to run the two services and wire them together and I’ve got a Caddy web server set up to let in traffic from the outside world.

Hardware

My setup is running on a cheap and cheerful AMD CPU and Motherboard package and a 10 year old Nvidia Titan X card (much better GPUS are available on Ebay for around £150. The RTX 3060 with 12GB VRAM would be a great choice). My server has 32GB RAM but this software combo uses a lot less than that. You could probably get away with 16GB and run it smoothly or possibly even 8GB at a push.

You could buy this bundle and a used RTX3060 on Ebay or a brand new one for around £250 and have a functional ChatGPT replacement in your house for less than £500.

Pros and Cons of Llama 3

Llama 3 8B truly is a huge step forward for open source alternatives to relying on APIS from OpenAI, Anthropic and their peers. I am still in the early stages of working with my self-hosted Llama 3 instance but so far I’m finding that it is just as capable as GPT-4 in many arenas.

Pro: Price

Self-hosting Llama 3 with Ollama and OpenWebUI is free-ish except for any initial investment you need to make for hardware and then electricity consumption. ChatGPT plus is currently $20/month but techies are likely also burning a similar amount in API calls too. I already had all the components for this build lying around the house but if I bought them 2nd hand it would take around 1 year for them to pay for themselves. That said, I could massively increase my API consumption through my self-hosted models since it’s effectively “free”.

Pro: Privacy

A huge advantage of this approach is that you’re not sending your data to an external company to be mined. The consumer version of ChatGPT that most people use is heavily data mined to improve OpenAI’s models and anything that you type in may end up in their corpus. Ollama runs entirely on your machine and never sends data back to any third party company.

Pro: Energy Consumption and Carbon Footprint

Another advantage is that since Llama 3:8B is small and it runs on a single GPU it uses a lot less energy to run than an average query to ChatGPT. My Titan X card consumes about 250 watts at max load but RTX 3060 cards only require 170 watts to run. Again, I had all the components lying around so I didn’t buy anything new to make this server and indeed it means I won’t be throwing away components that would otherwise become e-waste.

Con: Speed on old hardware

Self-hosting Llama 3 8B on a Titan X is a little slower than ChatGPT but is still perfectly serviceable. It would almost certainly be faster on RTX 3 and 4 series cards.

Con: Multimodal Performance

The biggest missing feature for me is currently multi-modal support. I use GPT-4 to do handwriting recognition and transcription for me and current gen open source models aren’t quite up to this yet. However, given the superb quality of Llama 3, I have no doubt that a similarly brilliant open multi-modal model is just around the corner.

Con: Training Transparency

Although Llama 3’s weights are free to download, the training corpus content is unknown. The model was built by Meta and thus is likely to have been trained on a large amount of user generated content and copyrighted content. Hosted third party models like ChatGPT are likely to be equally problematic in this regard but.

Setting up Llama 3 with Ollama and OpenWebUI

Once you have the hardware assembled and the operating system installed, the fiddliest part is configuring Docker and Nvidia correctly.

Ubuntu

If you’re on Ubuntu, you’ll need to install docker first. I recommend using the guide from Docker themselves which installs the latest and greatest packages. Then follow this guide to install the nvidia runtime. Then you will want to verify that it’s all set up using the checking step below.

Unraid

I actually run Unraid on my home server rather than Ubuntu. To get things running there, simply install the unraid nvidia plugin through the community apps page and make sure to stop and start docker before trying out the step below.

Checking the Docker and Nvidia Setup (All OSes)

To make sure that Docker and Nvidia are installed properly and able to talk to each other you can run:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This runs the nvidia-smi status utility which should show what your GPU is currently doing but crucially it’s doing so from inside docker which means that nvidia’s container runtime is all set up to pass through the nvidia drivers to whatever you’re running inside your container. You should see something like this:

Installing Ollama

Create a new directory and a new empty text file called docker-compose.yml. Into that file paste the following:

version: "3.0"
services:

  ui:
    image: ghcr.io/open-webui/open-webui:main
    restart: always
    ports:
      - 3011:8080
    volumes:
      - ./open-webui:/app/backend/data
    environment:
      # - "ENABLE_SIGNUP=false"
      - "OLLAMA_BASE_URL=http://ollama:11434"


  ollama:
    image: ollama/ollama
    restart: always
    ports:
      - 11434:11434
    volumes:
      - ./ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

We define the two services and we provide both with volume mounts to enable them to persist data to disk (such as models you downloaded and your chat history).

For now we leave ENABLE_SIGNUP commented out so that you can create an account in the web ui but later we can come back and turn that off so that internet denizens can’t sign up to use your chat.

Turn on Ollama

First we will turn on ollama and test it. Start by running docker-compose up -d ollama. (Depending on which version of docker you are running you might need to run docker compose rather than docker-compose). This will start just the ollama model server. We can interact with the model server by running an interactive chat session and downloading the model:

docker-compose exec ollama ollama run llama3:8b

In this command the first ollama refers to the container and ollama run llama3:8b is the command that will be executed inside the container. If all goes well you will see the server burst into action and download the llama3 model if this is the first time you've run it. You'll then be presented with an interactive prompt where you'll be able to chat to the model.

You can press CTRL+D to quit and move on to the next step.

Turn on the Web UI

Now we will start up the web ui. Run docker-compose up -d ui. Now open up your browser and go to http://localhost:3011/ to see the web ui. You will need to register for an account and log in. After which you will be able to interact with the model like so:

(Optional) Configure Outside Access

If you want to be able to chat to your models from the outside world you might want to stand up a reverse proxy to your server. If you’re new to self hosting and you’re not sure about how to do this, a safer option is probably to use Tailscale to build a VPN which you can use to securely connect to your home network without risking exposing your systems to the public and/or hackers.

Conclusion

Llama 3 is a tremendously powerful model that is useful for a whole bunch of use cases including summarisation, creative brainstorming, code copiloting and more. The quality of the responses are in line with GPT-4 and it runs on much older, smaller hardware. Self-hosting Llama 3 won’t be for everyone and it’s quite technically involved. However, for AI geeks like me, running my own ChatGPT clone at home for next-to-nothing was too good an experiment to miss out on.

Originally published at https://brainsteam.co.uk on April 20, 2024.