Run DeepSeek-R1-Distill-Llama-70B¶

Learn to run DeepSeek-R1-Distill-Llama-70B using text-generation-inference (TGI) on Intel XPUs. You may also experiment with using smaller DeepSeek-R1-Distill Models.

The DeepSeek-R1-Distill-Llama-70B model is a distilled version of the DeepSeek-R1 model, derived from the Llama-3.3-70B-Instruct architecture. It’s designed to emulate the reasoning capabilities of the original 671 billion parameter model while being more efficient. Balancing performance and efficiency, it’s a great choice for complex tasks with reduced computational requirements.

Prerequisites¶

Complete Get Started
Review and agree to DeepSeek-R1 License

Compute Instance¶

Recommended Compute Instances¶
Instance Type	Processor Model	Card Quantity	Disk	Memory	Link
Bare Metal	Intel® Data Center GPU Max Series 1550	8	2TB	2TB	Go to 1550
Bare Metal	Intel® Data Center GPU Max Series 1100	8	960GB	1TB	Go to 1100

Launch Instance¶

Visit the Intel® Tiber™ AI Cloud console home page.
Log into your account.
Click Catalog -> Hardware from the menu at left.
Click the filter GPU.
Select the instance type: Intel® Data Center GPU Max Series
Complete Instance configuration. Use one example from Compute Instance details.
1. Use one of these configurations: Intel® Data Center GPU Max Series 1550 BM or Intel® Data Center GPU Max Series 1100 BM.
2. For Instance type, choose one prefixed with bare metal, or BM.
3. For Machine image, use default.
Add Instance name.
Choose an option to connect.
1. One-Click connection Recommended
2. Public Keys
Click Launch to launch your instance.

Tip

Launch Container¶

Next, access your instance and launch the TGI container. We use the Intel® Data Center GPU Max Series 1550 as an example.

Note

For details on the Intel® Max Series product family, see GPU Instances.

docker run -it --rm \
--privileged \
--device=/dev/dri \
--ipc=host \
--ulimit memlock=-1 \
--shm-size=1g \
--cap-add=sys_nice \
--cap-add=IPC_LOCK \
-v ${HF_CACHE_DIR:-$HOME/.cache/huggingface}:/root/.cache/huggingface:rw \
-e HF_HOME=/root/.cache/huggingface \
-p 80:80 \
--entrypoint /bin/bash \
ghcr.io/huggingface/text-generation-inference:3.0.2-intel-xpu

Start Model Server¶

To treat slices on Intel® Data Center GPU Max Series 1550 as one, set this in your shell:
```
bashexport ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
```

In the container terminal, launch the model:

MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70Btext-generation-launcher \
--model-id ${MODEL_ID} \
--dtype bfloat16 \
--max-concurrent-requests 128 \
--max-batch-size 128 \
--max-total-tokens 4096 \
--max-input-length 2048 \
--max-waiting-tokens 10 \
--cuda-graphs 0 \
--num-shard=4 \
--port 80 \
--json-output

Wait for the model to be fully loaded. You should see this message:

{"timestamp":"2025-01-30T20:05:22.031688Z","level":"INFO","message":"Connected"}

Benchmarking¶

If you need to Benchmark using the model, open a new terminal and follow these steps.

Get the running container ID:

docker ps --filter ancestor=ghcr.io/huggingface/text-generation-inference:3.0.2-intel-xpu --format "{{.ID}}"

Connect to the container, using the output from the previous step.
```
docker exec -it <CONTAINER_ID> bash
```

Run the benchmark:

MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
text-generation-benchmark --tokenizer-name $MODEL_ID

The benchmark will run various configurations and and display output performance metrics when complete.