Run DeepSeek-R1-Distill-Llama-70B

Learn to run DeepSeek-R1-Distill-Llama-70B using text-generation-inference (TGI) on Intel XPUs. You may also experiment with using smaller DeepSeek-R1-Distill Models.

The DeepSeek-R1-Distill-Llama-70B model is a distilled version of the DeepSeek-R1 model, derived from the Llama-3.3-70B-Instruct architecture. It’s designed to emulate the reasoning capabilities of the original 671 billion parameter model while being more efficient. Balancing performance and efficiency, it’s a great choice for complex tasks with reduced computational requirements.

Prerequisites

Compute Instance

Recommended Compute Instances

Instance Type

Processor Model

Card Quantity

Disk

Memory

Link

Bare Metal

Intel® Data Center GPU Max Series 1550

8

2TB

2TB

Go to 1550

Bare Metal

Intel® Data Center GPU Max Series 1100

8

960GB

1TB

Go to 1100

Launch Instance

  1. Visit the Intel® Tiber™ AI Cloud console home page.

  2. Log into your account.

  3. Click Catalog -> Hardware from the menu at left.

  4. Click the filter GPU.

  5. Select the instance type: Intel® Data Center GPU Max Series

  6. Complete Instance configuration. Use one example from Compute Instance details.

    1. Use one of these configurations: Intel® Data Center GPU Max Series 1550 BM or Intel® Data Center GPU Max Series 1100 BM.

    2. For Instance type, choose one prefixed with bare metal, or BM.

    3. For Machine image, use default.

  7. Add Instance name.

  8. Choose an option to connect.

    1. One-Click connection Recommended

    2. Public Keys

  9. Click Launch to launch your instance.

Tip

See also Manage Instance.

Launch Container

Next, access your instance and launch the TGI container. We use the Intel® Data Center GPU Max Series 1550 as an example.

Note

For details on the Intel® Max Series product family, see GPU Instances.

docker run -it --rm \
--privileged \
--device=/dev/dri \
--ipc=host \
--ulimit memlock=-1 \
--shm-size=1g \
--cap-add=sys_nice \
--cap-add=IPC_LOCK \
-v ${HF_CACHE_DIR:-$HOME/.cache/huggingface}:/root/.cache/huggingface:rw \
-e HF_HOME=/root/.cache/huggingface \
-p 80:80 \
--entrypoint /bin/bash \
ghcr.io/huggingface/text-generation-inference:3.0.2-intel-xpu

Start Model Server

  1. To treat slices on Intel® Data Center GPU Max Series 1550 as one, set this in your shell:

    bashexport ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
    
  2. In the container terminal, launch the model:

    MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70Btext-generation-launcher \
    --model-id ${MODEL_ID} \
    --dtype bfloat16 \
    --max-concurrent-requests 128 \
    --max-batch-size 128 \
    --max-total-tokens 4096 \
    --max-input-length 2048 \
    --max-waiting-tokens 10 \
    --cuda-graphs 0 \
    --num-shard=4 \
    --port 80 \
    --json-output
    
  3. Wait for the model to be fully loaded. You should see this message:

    {"timestamp":"2025-01-30T20:05:22.031688Z","level":"INFO","message":"Connected"}
    

Benchmarking

If you need to Benchmark using the model, open a new terminal and follow these steps.

  1. Get the running container ID:

    docker ps --filter ancestor=ghcr.io/huggingface/text-generation-inference:3.0.2-intel-xpu --format "{{.ID}}"
    
  2. Connect to the container, using the output from the previous step.

    docker exec -it <CONTAINER_ID> bash
    
  3. Run the benchmark:

    MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
    text-generation-benchmark --tokenizer-name $MODEL_ID
    

The benchmark will run various configurations and and display output performance metrics when complete.