How to Deploy DeepSeek-r1 on LLM Locally: Complete Ollama & Open WebUI Guide

Introduction

Deploying Large Language Models (LLMs) locally can be challenging, but with DeepSeek-r1, it's now more accessible than ever.

In this blog, I'll guide you through deploying a DeepSeek-r1 model on your local CPU/GPU system. We'll focus on deploying both 1.5B and 7B parameter models, depending on your system's capabilities. We'll also optimize the Ollama server to handle multiple concurrent requests and improve its performance. Finally, we'll set up the Open WebUI for better model interaction and management.

Prerequisites and System Requirements

Before diving in, make sure you have the following:

Hardware:

CPU Deployment: A modern multi-core CPU (quad-core or better recommended).
RAM: 8GB or more is recommended.
Storage: SSD storage is highly recommended to reduce model load times.

Software:

Operating System: Linux, macOS, or Windows (with WSL2 for Linux-based tools).
Git: For cloning repositories.
Docker: (Recommended for containerized deployments).

Let's start by installing Ollama

1. Install Ollama

To install Ollama, execute the following command in your terminal to download and run the official installation script:

curl -fsSL https://ollama.com/install.sh | sh

Fig: Installing Ollama

⚠️ Since we are running the model on a CPU-based system this warning may appear which is normal and can be avoided

After running above script, Ollama will be installed on your system. Verify the installation by checking the version:

ollama --version

Fig: Ollama version

2. Pull the DeepSeek-r1 Models

Now that Ollama is installed, let's pull the DeepSeek-r1 models. You can download both the 1.5B and 7B versions using the Ollama CLI:

ollama pull deepseek-r1
ollama pull deepseek-r1:7b

To verify that both Ollama models are pulled successfully we can run:

ollama list

Fig: Listing Ollama models

3. Run the DeepSeek-r1 Models Locally [CLI]

After pulling the models, you can run them locally with Ollama. You can start each model using the following commands:

ollama run deepseek-r1
ollama run deepseek-r1:7b

Fig: Running Ollama CLI

Voila! The model is working. We can also send a query to the model to test if it responds properly:

Fig: Testing Ollama endpoint

4. Deploy the Ollama WebUI Using Docker

To simplify interaction with your models, you can deploy a web-based interface using Docker. This allows you to manage and query your models through a clean, browser-based UI.

First, make sure that docker is present in your system if not make sure you install it from here Docker Installation

Fig: Docker Version

5. Deploy the Open WebUI Using Docker

For more efficient model interaction, we'll deploy Open WebUI (formerly known as Ollama Web UI). The following Docker command will map container port 8080 to host port 3000 and ensure the container can communicate with the Ollama server running on your host machine.

Run the following Docker command:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Fig: Running OpenUI docker container

Now, the WebUI has been successfully deployed which can be accessed at http://localhost:3000

Fig: Open WebUI

6. Advanced Ollama Performance Tuning [ Optional ]

For users looking to squeeze extra performance and customize how Ollama handles model inference and resource management, you can adjust several environment variables before starting the Ollama server.

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_KEEP_ALIVE="-1"
export OLLAMA_MAX_LOADED_MODELS=3
export OLLAMA_NUM_PARALLEL=8
export OLLAMA_MAX_QUEUE=1024

OLLAMA_FLASH_ATTENTION: Enabling this can boost performance by optimizing attention mechanisms within the model.
OLLAMA_KV_CACHE_TYPE: This setting manages the key-value caching strategy; q8_0 can help reduce memory footprint while maintaining speed.
OLLAMA_KEEP_ALIVE: This controls connection persistence; setting it to -1 disables the timeout.
OLLAMA_MAX_LOADED_MODELS: Adjust this based on your system's memory capacity to control how many models are loaded concurrently.
OLLAMA_NUM_PARALLEL: Increasing this value can improve throughput on multi-core systems.
OLLAMA_MAX_QUEUE: This defines the maximum number of queued requests, useful for high-concurrency environments.

Conclusion

By following this blog post, you can seamlessly set up your local environment to deploy and interact with the DeepSeek-r1 model using Ollama and Open WebUI. You'll be able to harness both the 1.5B and 7B parameter models to their full potential.

Happy deploying and experimenting with your AI models!

Written By

Ujwal Budha

Published: 2nd january 2025

Hello, I am Ujwal Budha. Currently working as a Jr. Cloud Engineer at Adex International Pvt. Ltd. Expert in creating scalable cloud infrastructure and automating the workflow for deployment. An AWS Certified Solution Architect Associate, Ujwal enjoys sharing knowledge in the form of technical blogs and helping others to go through their cloud journey.