Run DeepSeek-R1-Distill-Llama-70B¶
Learn to run DeepSeek-R1-Distill-Llama-70B using text-generation-inference (TGI) on Intel XPUs. You may also experiment with using smaller DeepSeek-R1-Distill Models.
The DeepSeek-R1-Distill-Llama-70B model is a distilled version of the DeepSeek-R1 model, derived from the Llama-3.3-70B-Instruct architecture. It’s designed to emulate the reasoning capabilities of the original 671 billion parameter model while being more efficient. Balancing performance and efficiency, it’s a great choice for complex tasks with reduced computational requirements.
Prerequisites¶
Complete Get Started
Review and agree to DeepSeek-R1 License
Compute Instance¶
Instance Type |
Processor Model |
Card Quantity |
Disk |
Memory |
Link |
---|---|---|---|---|---|
Bare Metal |
Intel® Data Center GPU Max Series 1550 |
8 |
2TB |
2TB |
|
Bare Metal |
Intel® Data Center GPU Max Series 1100 |
8 |
960GB |
1TB |
Launch Instance¶
Visit the Intel® Tiber™ AI Cloud console home page.
Log into your account.
Click Catalog -> Hardware from the menu at left.
Click the filter GPU.
Select the instance type: Intel® Data Center GPU Max Series
Complete Instance configuration. Use one example from Compute Instance details.
Use one of these configurations: Intel® Data Center GPU Max Series 1550 BM or Intel® Data Center GPU Max Series 1100 BM.
For Instance type, choose one prefixed with bare metal, or
BM
.For Machine image, use default.
Add Instance name.
Choose an option to connect.
One-Click connection Recommended
Public Keys
Click Launch to launch your instance.
Tip
See also Manage Instance.
Launch Container¶
Next, access your instance and launch the TGI container. We use the Intel® Data Center GPU Max Series 1550 as an example.
Note
For details on the Intel® Max Series product family, see GPU Instances.
docker run -it --rm \
--privileged \
--device=/dev/dri \
--ipc=host \
--ulimit memlock=-1 \
--shm-size=1g \
--cap-add=sys_nice \
--cap-add=IPC_LOCK \
-v ${HF_CACHE_DIR:-$HOME/.cache/huggingface}:/root/.cache/huggingface:rw \
-e HF_HOME=/root/.cache/huggingface \
-p 80:80 \
--entrypoint /bin/bash \
ghcr.io/huggingface/text-generation-inference:3.0.2-intel-xpu
Start Model Server¶
To treat slices on Intel® Data Center GPU Max Series 1550 as one, set this in your shell:
bashexport ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
In the container terminal, launch the model:
MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70Btext-generation-launcher \ --model-id ${MODEL_ID} \ --dtype bfloat16 \ --max-concurrent-requests 128 \ --max-batch-size 128 \ --max-total-tokens 4096 \ --max-input-length 2048 \ --max-waiting-tokens 10 \ --cuda-graphs 0 \ --num-shard=4 \ --port 80 \ --json-output
Wait for the model to be fully loaded. You should see this message:
{"timestamp":"2025-01-30T20:05:22.031688Z","level":"INFO","message":"Connected"}
Benchmarking¶
If you need to Benchmark using the model, open a new terminal and follow these steps.
Get the running container ID:
docker ps --filter ancestor=ghcr.io/huggingface/text-generation-inference:3.0.2-intel-xpu --format "{{.ID}}"
Connect to the container, using the output from the previous step.
docker exec -it <CONTAINER_ID> bash
Run the benchmark:
MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B text-generation-benchmark --tokenizer-name $MODEL_ID
The benchmark will run various configurations and and display output performance metrics when complete.