Deploy DeepSeek-R1 671B Distributed with Run:ai + NIM

This is a hands-on tutorial for deploying DeepSeek-R1 (671B parameters) as a distributed inference workload on NVIDIA Run:ai. Two nodes, 16 H100 GPUs, one API endpoint.

The tutorial follows the official Run:ai workflow: create access credentials, set up model caching, deploy the distributed workload, and test the endpoint. By the end, you will have a production-ready DeepSeek-R1 serving endpoint with authenticated access.

Prerequisites

Before starting, ensure:

NVIDIA Run:ai platform is installed (self-hosted or SaaS)
LeaderWorkerSet (LWS) controller is installed on the cluster
2+ nodes with 8× NVIDIA H100 80GB GPUs each
InfiniBand networking between nodes (HDR minimum, NDR recommended)
NGC account with an active API key
A Run:ai project assigned to you by your administrator
External access configured if you need to reach the endpoint from outside the cluster

Architecture

The deployment creates two pods forming a Leader-Worker Set:

Client Request
     │
     ▼
Load Balancer → NGINX Ingress
     │
     ▼
┌─────────────────────────────┐
│  Leader Pod (Node 0)        │
│  NIM_LEADER_ROLE=1          │
│  NIM_NODE_RANK=0            │
│  8× H100 (TP=8)            │
│  Layers 0-39                │
│  Port 8000 (API endpoint)   │
└──────────┬──────────────────┘
           │ NCCL / InfiniBand
┌──────────▼──────────────────┐
│  Worker Pod (Node 1)        │
│  NIM_LEADER_ROLE=0          │
│  NIM_NODE_RANK=1            │
│  8× H100 (TP=8)            │
│  Layers 40-79               │
└─────────────────────────────┘

Total: TP=8, PP=2 → 16 GPUs
Model Profile: sglang-h100-bf16-tp8-pp2

Step 1: Create a User Access Key

Access keys authenticate API calls to the Run:ai platform.

In the Run:ai UI:

Click your user avatar (top right) → Settings
Click +ACCESS KEY
Enter a name (e.g., inference-deploy-key) and click CREATE
Copy the Client ID and Client Secret — store them securely

Get an API token:

# Replace <COMPANY_URL> with your Run:ai URL
# SaaS: <tenant-name>.run.ai
# Self-hosted: your Run:ai UI URL

curl -X POST 'https://<COMPANY_URL>/api/v1/token' \
  -H 'Content-Type: application/json' \
  -d '{
    "grantType": "client_credentials",
    "clientId": "<CLIENT_ID>",
    "clientSecret": "<CLIENT_SECRET>"
  }'

Save the returned token — you will use it in every subsequent API call.

Step 2: Create an NGC Credential

Store your NGC API key as a Run:ai user credential so it can be injected into workload containers.

In the Run:ai UI:

Click user avatar → Settings
Click +CREDENTIAL → select NGC API key
Enter a name (e.g., ngc-key)
Paste your NGC API key
Click CREATE CREDENTIAL

This credential will be referenced as <ngc-credential-name> in the workload configuration.

Step 3: Create a PVC Data Source

A Persistent Volume Claim caches downloaded model weights. DeepSeek-R1 is ~650 GB — you do not want to re-download it every time a pod restarts.

Via UI

Go to Workload Manager → Data Sources
Click +NEW DATA SOURCE → PVC
Set the scope (cluster or department level)
Name: nim-model-cache
Select New PVC:
- Storage class: your preferred class
- Access mode: Read-write by many nodes (required for leader + worker to share)
- Claim size: 2 TB (allows caching multiple models)
- Volume mode: Filesystem
- Container path: /opt/nim/.cache
Click CREATE DATA SOURCE

Via API

curl -L 'https://<COMPANY_URL>/api/v1/asset/datasource/pvc' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "meta": {
      "name": "nim-model-cache",
      "scope": "<scope>"
    },
    "spec": {
      "path": "/opt/nim/.cache",
      "existingPvc": false,
      "claimInfo": {
        "size": "2TB",
        "storageClass": "<your-storage-class>",
        "accessModes": {
          "readWriteMany": true
        },
        "volumeMode": "Filesystem"
      }
    }
  }'

Wait for the PVC to be provisioned. Note the claim name — you will need it in Step 4.

Step 4: Deploy the Distributed Inference Workload

This is the core step. The API call creates a Leader-Worker Set with the full NIM configuration.

Understanding the Environment Variables

Variable	Leader	Worker	Purpose
`NIM_LEADER_ROLE`	`1`	`0`	Identifies the leader vs worker
`NIM_MULTI_NODE`	`1`	`1`	Enables multi-node mode
`NIM_NODE_RANK`	Auto	Auto	Assigned from LWS worker-index label
`NIM_TENSOR_PARALLEL_SIZE`	`8`	`8`	Split layers across 8 GPUs per node
`NIM_PIPELINE_PARALLEL_SIZE`	`2`	`2`	Split model into 2 pipeline stages
`NIM_NUM_COMPUTE_NODES`	`2`	`2`	Total nodes in the deployment
`NIM_MODEL_PROFILE`	`sglang-h100-bf16-tp8-pp2`	same	Optimized config for H100
`NIM_USE_SGLANG`	`1`	`1`	Use SGLang inference runtime
`NIM_TRUST_CUSTOM_CODE`	`1`	`1`	Allow custom Python kernels
`NGC_API_KEY`	credential	credential	NGC authentication

The NIM_NODE_RANK is automatically populated from the Kubernetes label leaderworkerset.sigs.k8s.io/worker-index, so you do not hard-code it.

Via CLI

runai inference distributed submit deepseek-r1 \
  -p <project-id> \
  -i nvcr.io/nim/deepseek-ai/deepseek-r1:latest \
  --workers 1 \
  --serving-port "container=8000,authorization-type=authenticatedUsers" \
  -g 8 \
  --existing-pvc claimname=<pvc-claim-name>,path=/opt/nim/.cache \
  --env-secret NGC_API_KEY=<ngc-credential-name>,key=NGC_API_KEY \
  --environment NIM_NUM_COMPUTE_NODES=2 \
  --environment NIM_LEADER_ROLE=1 \
  --environment OMPI_MCA_orte_keep_fqdn_hostnames=1 \
  --environment "OMPI_MCA_plm_rsh_args=-o ConnectionAttempts=20" \
  --environment NIM_USE_SGLANG=1 \
  --environment NIM_MULTI_NODE=1 \
  --environment NIM_TENSOR_PARALLEL_SIZE=8 \
  --environment NIM_PIPELINE_PARALLEL_SIZE=2 \
  --environment NIM_TRUST_CUSTOM_CODE=1 \
  --environment NIM_MODEL_PROFILE=sglang-h100-bf16-tp8-pp2 \
  --env-pod-field-ref "NIM_NODE_RANK=metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']"

Via REST API

curl -L 'https://<COMPANY_URL>/api/v1/workloads/distributed-inferences' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "name": "deepseek-r1",
    "projectId": "<PROJECT_ID>",
    "clusterId": "<CLUSTER_UUID>",
    "spec": {
      "workers": 1,
      "servingPort": {
        "port": 8000,
        "authorizationType": "authenticatedUsers"
      },
      "leader": {
        "image": "nvcr.io/nim/deepseek-ai/deepseek-r1:latest",
        "environmentVariables": [
          {
            "name": "NGC_API_KEY",
            "userCredential": {
              "name": "<ngc-credential-name>",
              "key": "NGC_API_KEY"
            }
          },
          { "name": "NIM_LEADER_ROLE", "value": "1" },
          { "name": "OMPI_MCA_orte_keep_fqdn_hostnames", "value": "1" },
          { "name": "OMPI_MCA_plm_rsh_args", "value": "-o ConnectionAttempts=20" },
          { "name": "NIM_USE_SGLANG", "value": "1" },
          { "name": "NIM_MULTI_NODE", "value": "1" },
          { "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
          { "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
          { "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
          { "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
          {
            "name": "NIM_NODE_RANK",
            "podFieldRef": {
              "path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
            }
          },
          { "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
        ],
        "imagePullSecrets": [
          { "name": "<ngc-credential-name>", "userCredential": true }
        ],
        "storage": {
          "pvc": [{
            "path": "/opt/nim/.cache",
            "existingPvc": true,
            "claimName": "<pvc-claim-name>"
          }]
        },
        "compute": { "gpuDevicesRequest": 8 },
        "security": {
          "runAsUid": 1000,
          "runAsGid": 1000,
          "runAsNonRoot": true
        }
      },
      "worker": {
        "image": "nvcr.io/nim/deepseek-ai/deepseek-r1:latest",
        "environmentVariables": [
          {
            "name": "NGC_API_KEY",
            "userCredential": {
              "name": "<ngc-credential-name>",
              "key": "NGC_API_KEY"
            }
          },
          { "name": "NIM_LEADER_ROLE", "value": "0" },
          { "name": "NIM_USE_SGLANG", "value": "1" },
          { "name": "NIM_MULTI_NODE", "value": "1" },
          { "name": "NIM_TENSOR_PARALLEL_SIZE", "value": "8" },
          { "name": "NIM_PIPELINE_PARALLEL_SIZE", "value": "2" },
          { "name": "NIM_TRUST_CUSTOM_CODE", "value": "1" },
          { "name": "NIM_MODEL_PROFILE", "value": "sglang-h100-bf16-tp8-pp2" },
          {
            "name": "NIM_NODE_RANK",
            "podFieldRef": {
              "path": "metadata.labels['"'"'leaderworkerset.sigs.k8s.io/worker-index'"'"']"
            }
          },
          { "name": "NIM_NUM_COMPUTE_NODES", "value": "2" }
        ],
        "imagePullSecrets": [
          { "name": "<ngc-credential-name>", "userCredential": true }
        ],
        "storage": {
          "pvc": [{
            "path": "/opt/nim/.cache",
            "existingPvc": true,
            "claimName": "<pvc-claim-name>"
          }]
        },
        "compute": { "gpuDevicesRequest": 8 },
        "security": {
          "runAsUid": 1000,
          "runAsGid": 1000,
          "runAsNonRoot": true
        }
      }
    }
  }'

What Happens After Submission

Run:ai creates a LeaderWorkerSet with 2 pods
The scheduler places pods on nodes with available H100 GPUs, preferring same-rack placement
PVC is mounted at /opt/nim/.cache on both pods
First run: NIM downloads DeepSeek-R1 weights from NGC (~650 GB). This takes 15-30 minutes depending on bandwidth
Subsequent runs: Model loads from PVC cache — startup in 5-10 minutes
Leader establishes NCCL communication with worker via InfiniBand
SGLang runtime initializes with TP=8, PP=2 configuration
Leader pod starts the OpenAI-compatible API server on port 8000
Run:ai configures ingress routing to the leader pod

Step 5: Test the Endpoint

Get the Endpoint URL

The inference endpoint is available through Run:ai’s ingress. The URL format depends on your setup:

Internal: http://deepseek-r1.runai-<project>.svc.cluster.local:8000
External: https://<inference-endpoint>.<company-url>

Check the workload status in the Run:ai UI or via API to get the exact endpoint URL.

Send a Request

Since we configured authenticatedUsers access, include the Run:ai bearer token:

curl -X POST 'https://<ENDPOINT_URL>/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <RUNAI_TOKEN>' \
  -d '{
    "model": "deepseek-ai/deepseek-r1",
    "messages": [
      {
        "role": "user",
        "content": "Explain the difference between tensor parallelism and pipeline parallelism for LLM inference."
      }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

The response follows the OpenAI chat completions format:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "deepseek-ai/deepseek-r1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Tensor parallelism splits individual layers across GPUs..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 256,
    "total_tokens": 280
  }
}

Streaming Responses

curl -X POST 'https://<ENDPOINT_URL>/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <RUNAI_TOKEN>' \
  -d '{
    "model": "deepseek-ai/deepseek-r1",
    "messages": [{"role": "user", "content": "Write a haiku about distributed computing"}],
    "max_tokens": 64,
    "stream": true
  }'

OpenShift Considerations

If running on OpenShift, the security context is mandatory:

"security": {
  "runAsUid": 1000,
  "runAsGid": 1000,
  "runAsNonRoot": true
}

Without this, pods fail during model download because the PVC mount at /opt/nim/.cache defaults to root ownership, and OpenShift’s restricted SCC blocks root execution.

Troubleshooting

Pods Stuck in Pending

# Check if GPUs are available
runai list nodes --gpu

# Check if LWS controller is running
kubectl get pods -n lws-system

Common causes:

Not enough H100 nodes with 8 free GPUs
LWS controller not installed
Scheduling conflict with training workloads

NCCL Communication Failures

# Check pod logs for NCCL errors
runai logs deepseek-r1 --leader
runai logs deepseek-r1 --worker 0

Common causes:

InfiniBand not configured or unavailable
Nodes in different network segments
Missing RDMA device plugin

Model Download Timeout

First download of DeepSeek-R1 (~650 GB) can take 30+ minutes. If it times out:

Increase the pod startup timeout in Run:ai settings
Pre-populate the PVC by running a data loading job first
Verify NGC API key has access to the DeepSeek-R1 container

Worker Cannot Connect to Leader

The OMPI_MCA_plm_rsh_args=-o ConnectionAttempts=20 environment variable gives the worker 20 connection retry attempts. If it still fails:

Verify OMPI_MCA_orte_keep_fqdn_hostnames=1 is set on the leader
Check that both pods are in the same Kubernetes network namespace
Verify no NetworkPolicy is blocking inter-pod traffic

Adapting for Other Models

This tutorial uses DeepSeek-R1, but the pattern works for any model that supports NIM multinode:

Model	Image	Profile	TP	PP	Nodes
DeepSeek-R1	`deepseek-ai/deepseek-r1`	`sglang-h100-bf16-tp8-pp2`	8	2	2
Llama 3.1 405B	`meta/llama-3.1-405b-instruct`	Check NIM docs	8	2	2
Nemotron 340B	`nvidia/nemotron-340b`	Check NIM docs	8	2	2

Update the image, model profile, and parallelism settings for your target model. Check the NIM support matrix for hardware-specific profiles.

About the Author

I am Luca Berton, AI and Cloud Advisor. I design GPU inference platforms for enterprises deploying large language models. Book a consultation to discuss your distributed inference architecture.

Deploy DeepSeek-R1 671B Distributed with Run:ai + NIM

Prerequisites

Architecture

Step 1: Create a User Access Key

Step 2: Create an NGC Credential

Step 3: Create a PVC Data Source

Via UI

Via API

Step 4: Deploy the Distributed Inference Workload

Understanding the Environment Variables

Via CLI

Via REST API

What Happens After Submission

Step 5: Test the Endpoint

Get the Endpoint URL

Send a Request

Streaming Responses

OpenShift Considerations

Troubleshooting

Pods Stuck in Pending

NCCL Communication Failures

Model Download Timeout

Worker Cannot Connect to Leader

Adapting for Other Models

About the Author

Related Articles

Embodied AI Infrastructure for the Physical World

Is Your Website Ready for AI Agents?

AI Governance in Practice: Findings Remediation and Agent Identity

What Delivering Enterprise Copilot Assessments Actually Looks Like

Prerequisites

Architecture

Step 1: Create a User Access Key

Step 2: Create an NGC Credential

Step 3: Create a PVC Data Source

Via UI

Via API

Step 4: Deploy the Distributed Inference Workload

Understanding the Environment Variables

Via CLI

Via REST API

What Happens After Submission

Step 5: Test the Endpoint

Get the Endpoint URL

Send a Request

Streaming Responses

OpenShift Considerations

Troubleshooting

Pods Stuck in Pending

NCCL Communication Failures

Model Download Timeout

Worker Cannot Connect to Leader

Adapting for Other Models

Related Resources

About the Author

Related Articles

Embodied AI Infrastructure for the Physical World

Is Your Website Ready for AI Agents?

AI Governance in Practice: Findings Remediation and Agent Identity

What Delivering Enterprise Copilot Assessments Actually Looks Like