Skip to main content
🎀 Speaking at Red Hat Summit 2026 GPUs take flight: Safety-first multi-tenant Platform Engineering with NVIDIA and OpenShift AI Learn More
NVIDIA Run:ai NIM Distributed Inference Tutorial DeepSeek-R1 2026
AI

Deploy DeepSeek-R1 671B Distributed with Run:ai + NIM

Step-by-step tutorial deploying DeepSeek-R1 671B across 2 nodes with NVIDIA Run:ai. Leader-Worker Sets, NIM profiles, SGLang runtime, PVC caching, and.

LB
Luca Berton
Β· 5 min read

This is a hands-on tutorial for deploying DeepSeek-R1 (671B parameters) as a distributed inference workload on NVIDIA Run:ai. Two nodes, 16 H100 GPUs, one API endpoint.

The tutorial follows the official Run:ai workflow: create access credentials, set up model caching, deploy the distributed workload, and test the endpoint. By the end, you will have a production-ready DeepSeek-R1 serving endpoint with authenticated access.

Prerequisites

Before starting, ensure:

  • NVIDIA Run:ai platform is installed (self-hosted or SaaS)
  • LeaderWorkerSet (LWS) controller is installed on the cluster
  • 2+ nodes with 8Γ— NVIDIA H100 80GB GPUs each
  • InfiniBand networking between nodes (HDR minimum, NDR recommended)
  • NGC account with an active API key
  • A Run:ai project assigned to you by your administrator
  • External access configured if you need to reach the endpoint from outside the cluster

Architecture

The deployment creates two pods forming a Leader-Worker Set:

Client Request
     β”‚
     β–Ό
Load Balancer β†’ NGINX Ingress
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Leader Pod (Node 0)        β”‚
β”‚  NIM_LEADER_ROLE=1          β”‚
β”‚  NIM_NODE_RANK=0            β”‚
β”‚  8Γ— H100 (TP=8)            β”‚
β”‚  Layers 0-39                β”‚
β”‚  Port 8000 (API endpoint)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚ NCCL / InfiniBand
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Worker Pod (Node 1)        β”‚
β”‚  NIM_LEADER_ROLE=0          β”‚
β”‚  NIM_NODE_RANK=1            β”‚
β”‚  8Γ— H100 (TP=8)            β”‚
β”‚  Layers 40-79               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total: TP=8, PP=2 β†’ 16 GPUs
Model Profile: sglang-h100-bf16-tp8-pp2

Step 1: Create a User Access Key

Access keys authenticate API calls to the Run:ai platform.

In the Run:ai UI:

  1. Click your user avatar (top right) β†’ Settings
  2. Click +ACCESS KEY
  3. Enter a name (e.g., inference-deploy-key) and click CREATE
  4. Copy the Client ID and Client Secret β€” store them securely

Get an API token:

# Replace <COMPANY_URL> with your Run:ai URL
# SaaS: <tenant-name>.run.ai
# Self-hosted: your Run:ai UI URL

curl -X POST 'https://<COMPANY_URL>/api/v1/token' \
  -H 'Content-Type: application/json' \
  -d '{
    "grantType": "client_credentials",
    "clientId": "<CLIENT_ID>",
    "clientSecret": "<CLIENT_SECRET>"
  }'

Save the returned token β€” you will use it in every subsequent API call.

Step 2: Create an NGC Credential

Store your NGC API key as a Run:ai user credential so it can be injected into workload containers.

In the Run:ai UI:

  1. Click user avatar β†’ Settings
  2. Click +CREDENTIAL β†’ select NGC API key
  3. Enter a name (e.g., ngc-key)
  4. Paste your NGC API key
  5. Click CREATE CREDENTIAL

This credential will be referenced as <ngc-credential-name> in the workload configuration.

Step 3: Create a PVC Data Source

A Persistent Volume Claim caches downloaded model weights. DeepSeek-R1 is ~650 GB β€” you do not want to re-download it every time a pod restarts.

Via UI

  1. Go to Workload Manager β†’ Data Sources
  2. Click +NEW DATA SOURCE β†’ PVC
  3. Set the scope (cluster or department level)
  4. Name: nim-model-cache
  5. Select New PVC:
    • Storage class: your preferred class
    • Access mode: Read-write by many nodes (required for leader + worker to share)
    • Claim size: 2 TB (allows caching multiple models)
    • Volume mode: Filesystem
    • Container path: /opt/nim/.cache
  6. Click CREATE DATA SOURCE

Via API

curl -L 'https://<COMPANY_URL>/api/v1/asset/datasource/pvc' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "meta": {
      "name": "nim-model-cache",
      "scope": "<scope>"
    },
    "spec": {
      "path": "/opt/nim/.cache",
      "existingPvc": false,
      "claimInfo": {
        "size": "2TB",
        "storageClass": "<your-storage-class>",
        "accessModes": {
          "readWriteMany": true
        },
        "volumeMode": "Filesystem"
      }
    }
  }'

Wait for the PVC to be provisioned. Note the claim name β€” you will need it in Step 4.

Step 4: Deploy the Distributed Inference Workload

This is the core step. The API call creates a Leader-Worker Set with the full NIM configuration.

Understanding the Environment Variables

VariableLeaderWorkerPurpose
NIM_LEADER_ROLE10Identifies the leader vs worker
NIM_MULTI_NODE11Enables multi-node mode
NIM_NODE_RANKAutoAutoAssigned from LWS worker-index label
NIM_TENSOR_PARALLEL_SIZE88Split layers across 8 GPUs per node
NIM_PIPELINE_PARALLEL_SIZE22Split model into 2 pipeline stages
NIM_NUM_COMPUTE_NODES22Total nodes in the deployment
NIM_MODEL_PROFILEsglang-h100-bf16-tp8-pp2sameOptimized config for H100
NIM_USE_SGLANG11Use SGLang inference runtime
NIM_TRUST_CUSTOM_CODE11Allow custom Python kernels
NGC_API_KEYcredentialcredentialNGC authentication

The NIM_NODE_RANK is automatically populated from the Kubernetes label leaderworkerset.sigs.k8s.io/worker-index, so you do not hard-code it.

Via CLI

runai inference distributed submit deepseek-r1 \
  -p <project-id> \
  -i nvcr.io/nim/deepseek-ai/deepseek-r1:latest \
  --workers 1 \
  --serving-port "container=8000,authorization-type=authenticatedUsers" \
  -g 8 \
  --existing-pvc claimname=<pvc-claim-name>,path=/opt/nim/.cache \
  --env-secret NGC_API_KEY=<ngc-credential-name>,key=NGC_API_KEY \
  --environment NIM_NUM_COMPUTE_NODES=2 \
  --environment NIM_LEADER_ROLE=1 \
  --environment OMPI_MCA_orte_keep_fqdn_hostnames=1 \
  --environment "OMPI_MCA_plm_rsh_args=-o ConnectionAttempts=20" \
  --environment NIM_USE_SGLANG=1 \
  --environment NIM_MULTI_NODE=1 \
  --environment NIM_TENSOR_PARALLEL_SIZE=8 \
  --environment NIM_PIPELINE_PARALLEL_SIZE=2 \
  --environment NIM_TRUST_CUSTOM_CODE=1 \
  --environment NIM_MODEL_PROFILE=sglang-h100-bf16-tp8-pp2 \
  --env-pod-field-ref "NIM_NODE_RANK=metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"

Via REST API

curl -L 'https://<COMPANY_URL>/api/v1/workloads/distributed-inferences' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "name": "deepseek-r1",
    "projectId": "<PROJECT_ID>",
    "clusterId": "<CLUSTER_UUID>",
    "spec": {
      "workers": 1,
      "servingPort": {
        "port": 8000,
        "authorizationType": "authenticatedUsers"
      },
      "leader": {
        "image": "nvcr.io/nim/deepseek-ai/deepseek-r1:latest",
        "environmentVariables": [
          {
            "name": "NGC_API_KEY",
            "userCredential": {
              "name": "<ngc-credential-name>",
              "key": "NGC_API_KEY"
            }
          },
          { "name": "NIM_LEADER_ROLE", "value": "1" },
          { "name": "OMPI_MCA_orte_keep_fqdn_hostnames", "value": "1" },
          { "name": "OMPI_MCA_plm_rsh_args", "value": "-o ConnectionAttempts=20" },
          { "name": "NIM_USE_SGLANG", "value": "1" },
          { "name": "NIM_MULTI_NODE", "value": "1" },
          { "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
          { "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
          { "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
          { "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
          {
            "name": "NIM_NODE_RANK",
            "podFieldRef": {
              "path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
            }
          },
          { "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
        ],
        "imagePullSecrets": [
          { "name": "<ngc-credential-name>", "userCredential": true }
        ],
        "storage": {
          "pvc": [{
            "path": "/opt/nim/.cache",
            "existingPvc": true,
            "claimName": "<pvc-claim-name>"
          }]
        },
        "compute": { "gpuDevicesRequest": 8 },
        "security": {
          "runAsUid": 1000,
          "runAsGid": 1000,
          "runAsNonRoot": true
        }
      },
      "worker": {
        "image": "nvcr.io/nim/deepseek-ai/deepseek-r1:latest",
        "environmentVariables": [
          {
            "name": "NGC_API_KEY",
            "userCredential": {
              "name": "<ngc-credential-name>",
              "key": "NGC_API_KEY"
            }
          },
          { "name": "NIM_LEADER_ROLE", "value": "0" },
          { "name": "NIM_USE_SGLANG", "value": "1" },
          { "name": "NIM_MULTI_NODE", "value": "1" },
          { "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
          { "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
          { "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
          { "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
          {
            "name": "NIM_NODE_RANK",
            "podFieldRef": {
              "path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
            }
          },
          { "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
        ],
        "imagePullSecrets": [
          { "name": "<ngc-credential-name>", "userCredential": true }
        ],
        "storage": {
          "pvc": [{
            "path": "/opt/nim/.cache",
            "existingPvc": true,
            "claimName": "<pvc-claim-name>"
          }]
        },
        "compute": { "gpuDevicesRequest": 8 },
        "security": {
          "runAsUid": 1000,
          "runAsGid": 1000,
          "runAsNonRoot": true
        }
      }
    }
  }'

What Happens After Submission

  1. Run:ai creates a LeaderWorkerSet with 2 pods
  2. The scheduler places pods on nodes with available H100 GPUs, preferring same-rack placement
  3. PVC is mounted at /opt/nim/.cache on both pods
  4. First run: NIM downloads DeepSeek-R1 weights from NGC (~650 GB). This takes 15-30 minutes depending on bandwidth
  5. Subsequent runs: Model loads from PVC cache β€” startup in 5-10 minutes
  6. Leader establishes NCCL communication with worker via InfiniBand
  7. SGLang runtime initializes with TP=8, PP=2 configuration
  8. Leader pod starts the OpenAI-compatible API server on port 8000
  9. Run:ai configures ingress routing to the leader pod

Step 5: Test the Endpoint

Get the Endpoint URL

The inference endpoint is available through Run:ai’s ingress. The URL format depends on your setup:

  • Internal: http://deepseek-r1.runai-<project>.svc.cluster.local:8000
  • External: https://<inference-endpoint>.<company-url>

Check the workload status in the Run:ai UI or via API to get the exact endpoint URL.

Send a Request

Since we configured authenticatedUsers access, include the Run:ai bearer token:

curl -X POST 'https://<ENDPOINT_URL>/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <RUNAI_TOKEN>' \
  -d '{
    "model": "deepseek-ai/deepseek-r1",
    "messages": [
      {
        "role": "user",
        "content": "Explain the difference between tensor parallelism and pipeline parallelism for LLM inference."
      }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

The response follows the OpenAI chat completions format:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "deepseek-ai/deepseek-r1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Tensor parallelism splits individual layers across GPUs..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 256,
    "total_tokens": 280
  }
}

Streaming Responses

curl -X POST 'https://<ENDPOINT_URL>/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <RUNAI_TOKEN>' \
  -d '{
    "model": "deepseek-ai/deepseek-r1",
    "messages": [{"role": "user", "content": "Write a haiku about distributed computing"}],
    "max_tokens": 64,
    "stream": true
  }'

OpenShift Considerations

If running on OpenShift, the security context is mandatory:

"security": {
  "runAsUid": 1000,
  "runAsGid": 1000,
  "runAsNonRoot": true
}

Without this, pods fail during model download because the PVC mount at /opt/nim/.cache defaults to root ownership, and OpenShift’s restricted SCC blocks root execution.

Troubleshooting

Pods Stuck in Pending

# Check if GPUs are available
runai list nodes --gpu

# Check if LWS controller is running
kubectl get pods -n lws-system

Common causes:

  • Not enough H100 nodes with 8 free GPUs
  • LWS controller not installed
  • Scheduling conflict with training workloads

NCCL Communication Failures

# Check pod logs for NCCL errors
runai logs deepseek-r1 --leader
runai logs deepseek-r1 --worker 0

Common causes:

  • InfiniBand not configured or unavailable
  • Nodes in different network segments
  • Missing RDMA device plugin

Model Download Timeout

First download of DeepSeek-R1 (~650 GB) can take 30+ minutes. If it times out:

  • Increase the pod startup timeout in Run:ai settings
  • Pre-populate the PVC by running a data loading job first
  • Verify NGC API key has access to the DeepSeek-R1 container

Worker Cannot Connect to Leader

The OMPI_MCA_plm_rsh_args=-o ConnectionAttempts=20 environment variable gives the worker 20 connection retry attempts. If it still fails:

  • Verify OMPI_MCA_orte_keep_fqdn_hostnames=1 is set on the leader
  • Check that both pods are in the same Kubernetes network namespace
  • Verify no NetworkPolicy is blocking inter-pod traffic

Adapting for Other Models

This tutorial uses DeepSeek-R1, but the pattern works for any model that supports NIM multinode:

ModelImageProfileTPPPNodes
DeepSeek-R1deepseek-ai/deepseek-r1sglang-h100-bf16-tp8-pp2822
Llama 3.1 405Bmeta/llama-3.1-405b-instructCheck NIM docs822
Nemotron 340Bnvidia/nemotron-340bCheck NIM docs822

Update the image, model profile, and parallelism settings for your target model. Check the NIM support matrix for hardware-specific profiles.

About the Author

I am Luca Berton, AI and Cloud Advisor. I design GPU inference platforms for enterprises deploying large language models. Book a consultation to discuss your distributed inference architecture.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens Heaven Art Shop TechMeOut

Free 30-min AI & Cloud consultation

Book Now