This is a hands-on tutorial for deploying DeepSeek-R1 (671B parameters) as a distributed inference workload on NVIDIA Run:ai. Two nodes, 16 H100 GPUs, one API endpoint.
The tutorial follows the official Run:ai workflow: create access credentials, set up model caching, deploy the distributed workload, and test the endpoint. By the end, you will have a production-ready DeepSeek-R1 serving endpoint with authenticated access.
Prerequisites
Before starting, ensure:
- NVIDIA Run:ai platform is installed (self-hosted or SaaS)
- LeaderWorkerSet (LWS) controller is installed on the cluster
- 2+ nodes with 8Γ NVIDIA H100 80GB GPUs each
- InfiniBand networking between nodes (HDR minimum, NDR recommended)
- NGC account with an active API key
- A Run:ai project assigned to you by your administrator
- External access configured if you need to reach the endpoint from outside the cluster
Architecture
The deployment creates two pods forming a Leader-Worker Set:
Client Request
β
βΌ
Load Balancer β NGINX Ingress
β
βΌ
βββββββββββββββββββββββββββββββ
β Leader Pod (Node 0) β
β NIM_LEADER_ROLE=1 β
β NIM_NODE_RANK=0 β
β 8Γ H100 (TP=8) β
β Layers 0-39 β
β Port 8000 (API endpoint) β
ββββββββββββ¬βββββββββββββββββββ
β NCCL / InfiniBand
ββββββββββββΌβββββββββββββββββββ
β Worker Pod (Node 1) β
β NIM_LEADER_ROLE=0 β
β NIM_NODE_RANK=1 β
β 8Γ H100 (TP=8) β
β Layers 40-79 β
βββββββββββββββββββββββββββββββ
Total: TP=8, PP=2 β 16 GPUs
Model Profile: sglang-h100-bf16-tp8-pp2Step 1: Create a User Access Key
Access keys authenticate API calls to the Run:ai platform.
In the Run:ai UI:
- Click your user avatar (top right) β Settings
- Click +ACCESS KEY
- Enter a name (e.g.,
inference-deploy-key) and click CREATE - Copy the Client ID and Client Secret β store them securely
Get an API token:
# Replace <COMPANY_URL> with your Run:ai URL
# SaaS: <tenant-name>.run.ai
# Self-hosted: your Run:ai UI URL
curl -X POST 'https://<COMPANY_URL>/api/v1/token' \
-H 'Content-Type: application/json' \
-d '{
"grantType": "client_credentials",
"clientId": "<CLIENT_ID>",
"clientSecret": "<CLIENT_SECRET>"
}'Save the returned token β you will use it in every subsequent API call.
Step 2: Create an NGC Credential
Store your NGC API key as a Run:ai user credential so it can be injected into workload containers.
In the Run:ai UI:
- Click user avatar β Settings
- Click +CREDENTIAL β select NGC API key
- Enter a name (e.g.,
ngc-key) - Paste your NGC API key
- Click CREATE CREDENTIAL
This credential will be referenced as <ngc-credential-name> in the workload configuration.
Step 3: Create a PVC Data Source
A Persistent Volume Claim caches downloaded model weights. DeepSeek-R1 is ~650 GB β you do not want to re-download it every time a pod restarts.
Via UI
- Go to Workload Manager β Data Sources
- Click +NEW DATA SOURCE β PVC
- Set the scope (cluster or department level)
- Name:
nim-model-cache - Select New PVC:
- Storage class: your preferred class
- Access mode: Read-write by many nodes (required for leader + worker to share)
- Claim size: 2 TB (allows caching multiple models)
- Volume mode: Filesystem
- Container path:
/opt/nim/.cache
- Click CREATE DATA SOURCE
Via API
curl -L 'https://<COMPANY_URL>/api/v1/asset/datasource/pvc' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"meta": {
"name": "nim-model-cache",
"scope": "<scope>"
},
"spec": {
"path": "/opt/nim/.cache",
"existingPvc": false,
"claimInfo": {
"size": "2TB",
"storageClass": "<your-storage-class>",
"accessModes": {
"readWriteMany": true
},
"volumeMode": "Filesystem"
}
}
}'Wait for the PVC to be provisioned. Note the claim name β you will need it in Step 4.
Step 4: Deploy the Distributed Inference Workload
This is the core step. The API call creates a Leader-Worker Set with the full NIM configuration.
Understanding the Environment Variables
| Variable | Leader | Worker | Purpose |
|---|---|---|---|
NIM_LEADER_ROLE | 1 | 0 | Identifies the leader vs worker |
NIM_MULTI_NODE | 1 | 1 | Enables multi-node mode |
NIM_NODE_RANK | Auto | Auto | Assigned from LWS worker-index label |
NIM_TENSOR_PARALLEL_SIZE | 8 | 8 | Split layers across 8 GPUs per node |
NIM_PIPELINE_PARALLEL_SIZE | 2 | 2 | Split model into 2 pipeline stages |
NIM_NUM_COMPUTE_NODES | 2 | 2 | Total nodes in the deployment |
NIM_MODEL_PROFILE | sglang-h100-bf16-tp8-pp2 | same | Optimized config for H100 |
NIM_USE_SGLANG | 1 | 1 | Use SGLang inference runtime |
NIM_TRUST_CUSTOM_CODE | 1 | 1 | Allow custom Python kernels |
NGC_API_KEY | credential | credential | NGC authentication |
The NIM_NODE_RANK is automatically populated from the Kubernetes label leaderworkerset.sigs.k8s.io/worker-index, so you do not hard-code it.
Via CLI
runai inference distributed submit deepseek-r1 \
-p <project-id> \
-i nvcr.io/nim/deepseek-ai/deepseek-r1:latest \
--workers 1 \
--serving-port "container=8000,authorization-type=authenticatedUsers" \
-g 8 \
--existing-pvc claimname=<pvc-claim-name>,path=/opt/nim/.cache \
--env-secret NGC_API_KEY=<ngc-credential-name>,key=NGC_API_KEY \
--environment NIM_NUM_COMPUTE_NODES=2 \
--environment NIM_LEADER_ROLE=1 \
--environment OMPI_MCA_orte_keep_fqdn_hostnames=1 \
--environment "OMPI_MCA_plm_rsh_args=-o ConnectionAttempts=20" \
--environment NIM_USE_SGLANG=1 \
--environment NIM_MULTI_NODE=1 \
--environment NIM_TENSOR_PARALLEL_SIZE=8 \
--environment NIM_PIPELINE_PARALLEL_SIZE=2 \
--environment NIM_TRUST_CUSTOM_CODE=1 \
--environment NIM_MODEL_PROFILE=sglang-h100-bf16-tp8-pp2 \
--env-pod-field-ref "NIM_NODE_RANK=metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"Via REST API
curl -L 'https://<COMPANY_URL>/api/v1/workloads/distributed-inferences' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"name": "deepseek-r1",
"projectId": "<PROJECT_ID>",
"clusterId": "<CLUSTER_UUID>",
"spec": {
"workers": 1,
"servingPort": {
"port": 8000,
"authorizationType": "authenticatedUsers"
},
"leader": {
"image": "nvcr.io/nim/deepseek-ai/deepseek-r1:latest",
"environmentVariables": [
{
"name": "NGC_API_KEY",
"userCredential": {
"name": "<ngc-credential-name>",
"key": "NGC_API_KEY"
}
},
{ "name": "NIM_LEADER_ROLE", "value": "1" },
{ "name": "OMPI_MCA_orte_keep_fqdn_hostnames", "value": "1" },
{ "name": "OMPI_MCA_plm_rsh_args", "value": "-o ConnectionAttempts=20" },
{ "name": "NIM_USE_SGLANG", "value": "1" },
{ "name": "NIM_MULTI_NODE", "value": "1" },
{ "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
{ "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
{ "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
{ "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
{
"name": "NIM_NODE_RANK",
"podFieldRef": {
"path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
}
},
{ "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
],
"imagePullSecrets": [
{ "name": "<ngc-credential-name>", "userCredential": true }
],
"storage": {
"pvc": [{
"path": "/opt/nim/.cache",
"existingPvc": true,
"claimName": "<pvc-claim-name>"
}]
},
"compute": { "gpuDevicesRequest": 8 },
"security": {
"runAsUid": 1000,
"runAsGid": 1000,
"runAsNonRoot": true
}
},
"worker": {
"image": "nvcr.io/nim/deepseek-ai/deepseek-r1:latest",
"environmentVariables": [
{
"name": "NGC_API_KEY",
"userCredential": {
"name": "<ngc-credential-name>",
"key": "NGC_API_KEY"
}
},
{ "name": "NIM_LEADER_ROLE", "value": "0" },
{ "name": "NIM_USE_SGLANG", "value": "1" },
{ "name": "NIM_MULTI_NODE", "value": "1" },
{ "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
{ "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
{ "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
{ "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
{
"name": "NIM_NODE_RANK",
"podFieldRef": {
"path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
}
},
{ "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
],
"imagePullSecrets": [
{ "name": "<ngc-credential-name>", "userCredential": true }
],
"storage": {
"pvc": [{
"path": "/opt/nim/.cache",
"existingPvc": true,
"claimName": "<pvc-claim-name>"
}]
},
"compute": { "gpuDevicesRequest": 8 },
"security": {
"runAsUid": 1000,
"runAsGid": 1000,
"runAsNonRoot": true
}
}
}
}'What Happens After Submission
- Run:ai creates a LeaderWorkerSet with 2 pods
- The scheduler places pods on nodes with available H100 GPUs, preferring same-rack placement
- PVC is mounted at
/opt/nim/.cacheon both pods - First run: NIM downloads DeepSeek-R1 weights from NGC (~650 GB). This takes 15-30 minutes depending on bandwidth
- Subsequent runs: Model loads from PVC cache β startup in 5-10 minutes
- Leader establishes NCCL communication with worker via InfiniBand
- SGLang runtime initializes with TP=8, PP=2 configuration
- Leader pod starts the OpenAI-compatible API server on port 8000
- Run:ai configures ingress routing to the leader pod
Step 5: Test the Endpoint
Get the Endpoint URL
The inference endpoint is available through Run:aiβs ingress. The URL format depends on your setup:
- Internal:
http://deepseek-r1.runai-<project>.svc.cluster.local:8000 - External:
https://<inference-endpoint>.<company-url>
Check the workload status in the Run:ai UI or via API to get the exact endpoint URL.
Send a Request
Since we configured authenticatedUsers access, include the Run:ai bearer token:
curl -X POST 'https://<ENDPOINT_URL>/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <RUNAI_TOKEN>' \
-d '{
"model": "deepseek-ai/deepseek-r1",
"messages": [
{
"role": "user",
"content": "Explain the difference between tensor parallelism and pipeline parallelism for LLM inference."
}
],
"max_tokens": 512,
"temperature": 0.7
}'The response follows the OpenAI chat completions format:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "deepseek-ai/deepseek-r1",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Tensor parallelism splits individual layers across GPUs..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 256,
"total_tokens": 280
}
}Streaming Responses
curl -X POST 'https://<ENDPOINT_URL>/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <RUNAI_TOKEN>' \
-d '{
"model": "deepseek-ai/deepseek-r1",
"messages": [{"role": "user", "content": "Write a haiku about distributed computing"}],
"max_tokens": 64,
"stream": true
}'OpenShift Considerations
If running on OpenShift, the security context is mandatory:
"security": {
"runAsUid": 1000,
"runAsGid": 1000,
"runAsNonRoot": true
}Without this, pods fail during model download because the PVC mount at /opt/nim/.cache defaults to root ownership, and OpenShiftβs restricted SCC blocks root execution.
Troubleshooting
Pods Stuck in Pending
# Check if GPUs are available
runai list nodes --gpu
# Check if LWS controller is running
kubectl get pods -n lws-systemCommon causes:
- Not enough H100 nodes with 8 free GPUs
- LWS controller not installed
- Scheduling conflict with training workloads
NCCL Communication Failures
# Check pod logs for NCCL errors
runai logs deepseek-r1 --leader
runai logs deepseek-r1 --worker 0Common causes:
- InfiniBand not configured or unavailable
- Nodes in different network segments
- Missing RDMA device plugin
Model Download Timeout
First download of DeepSeek-R1 (~650 GB) can take 30+ minutes. If it times out:
- Increase the pod startup timeout in Run:ai settings
- Pre-populate the PVC by running a data loading job first
- Verify NGC API key has access to the DeepSeek-R1 container
Worker Cannot Connect to Leader
The OMPI_MCA_plm_rsh_args=-o ConnectionAttempts=20 environment variable gives the worker 20 connection retry attempts. If it still fails:
- Verify
OMPI_MCA_orte_keep_fqdn_hostnames=1is set on the leader - Check that both pods are in the same Kubernetes network namespace
- Verify no NetworkPolicy is blocking inter-pod traffic
Adapting for Other Models
This tutorial uses DeepSeek-R1, but the pattern works for any model that supports NIM multinode:
| Model | Image | Profile | TP | PP | Nodes |
|---|---|---|---|---|---|
| DeepSeek-R1 | deepseek-ai/deepseek-r1 | sglang-h100-bf16-tp8-pp2 | 8 | 2 | 2 |
| Llama 3.1 405B | meta/llama-3.1-405b-instruct | Check NIM docs | 8 | 2 | 2 |
| Nemotron 340B | nvidia/nemotron-340b | Check NIM docs | 8 | 2 | 2 |
Update the image, model profile, and parallelism settings for your target model. Check the NIM support matrix for hardware-specific profiles.
Related Resources
- NVIDIA Run:ai Distributed Inference Overview
- NVIDIA NIM Multinode Inference
- NVIDIA GPU Operator on Kubernetes
- Multi-Tenant GPUs on Bare Metal
- The Inference Gold Rush
- FinOps for AI: GPU Cost Optimization
- Autoscaling AI Inference
- Official Run:ai Tutorial
About the Author
I am Luca Berton, AI and Cloud Advisor. I design GPU inference platforms for enterprises deploying large language models. Book a consultation to discuss your distributed inference architecture.