Training RetinaNet with DDP on Run:ai: Multi-Node Object

While LLM fine-tuning dominates the conversation, computer vision training workloads remain critical in enterprise AI. Object detection models like RetinaNet power manufacturing inspection, autonomous systems, and medical imaging. Here’s how to train RetinaNet at scale using PyTorch DDP on Run:ai with OpenShift.

Architecture: DDP vs FSDP for Vision Models

For computer vision models (typically 30-100M parameters), you don’t need FSDP’s weight sharding — the model fits comfortably on a single GPU. Instead, Distributed Data Parallel (DDP) replicates the full model on each GPU and synchronizes gradients:

┌────────────────────┐     ┌────────────────────┐
│  Node 0            │     │  Node 1            │
│  Full model copy   │     │  Full model copy   │
│  Batch shard A     │────▶│  Batch shard B     │
│  Compute gradients │◀────│  Compute gradients │
│                    │     │                    │
│  AllReduce grads   │     │  AllReduce grads   │
│  Update weights    │     │  Update weights    │
└────────────────────┘     └────────────────────┘

Aspect	DDP (Vision)	FSDP (LLMs)
Model size	30-100M params	7B-405B params
Memory per GPU	Full model (~200MB)	Sharded (fraction)
Communication	Gradient AllReduce only	Params + Grads + Optimizer
Use case	ResNet, RetinaNet, YOLO	Llama, Mistral, GPT
Complexity	Low	High

Open Images Dataset V7

Open Images V7 is one of the largest annotated image datasets available:

9 million images with image-level labels
16 million bounding boxes across 600 categories
2.5 million instance segmentations
Train/Validation/Test splits

Downloading the Dataset

# Download the official downloader script
wget https://raw.githubusercontent.com/openimages/dataset/master/downloader.py

# Create a text file with image IDs to download
# Format: $SPLIT/$IMAGE_ID
# Example:
# train/f9e0434389a1d4dd
# train/1a007563ebc18664
# test/ea8bfd4e765304db

# Download with parallel processes
python downloader.py $IMAGE_LIST_FILE \
  --download_folder=$DOWNLOAD_FOLDER \
  --num_processes=5

For enterprise environments, pre-download the dataset to a shared PVC (Persistent Volume Claim) accessible by all training nodes.

Run:ai Job Submission

The training job submission uses Run:ai’s PyTorch distributed training support:

#!/bin/bash
set -euo pipefail

export MSYS_NO_PATHCONV=1
export MSYS2_ARG_CONV_EXCL="*"

IMAGE="nvcr.io/nvidia/pytorch:26.02-py3"
JOB_NAME="retinanet-cpu-ddp"

runai training pytorch submit $JOB_NAME \
  --image $IMAGE \
  --annotation "k8s.v1.cni.cncf.io/networks=ssa" \
  --extended-resource "openshift.io/mellanoxnics=1" \
  --large-shm \
  --workers 2 \
  --gpu-devices-request 1 \
  --cpu-memory-request 3846 \
  --cpu-memory-limit 8996 \
  --run-as-uid 2000 \
  --run-as-gid 2000 \
  --working-dir /data/scripts/ia-gen-bench/llm/finetune-peft \
  --environment-variable PYTORCH_ALLOC_CONF=expandable_segments:True \
  --environment-variable NCCL_DEBUG="INFO" \
  --environment-variable NCCL_IB_DISABLE=1 \
  --environment-variable NCCL_SOCKET_NTHREADS=2 \
  --environment-variable NCCL_NSOCKS_PERTHREAD=2 \
  --environment-variable NCCL_SOCKET_IFNAME="net1" \
  --environment-variable CUDA_VISIBLE_DEVICES=0 \
  --existing-pvc claimname=project-001,path=/data \
  --command -- /data/scripts/train-retinanet-ddp.sh

Key Differences from LLM Training

NCCL_IB_DISABLE=1 — For DDP with smaller vision models, gradient AllReduce is lightweight enough that InfiniBand RDMA isn’t needed. TCP over the secondary network interface suffices, simplifying the setup.

No FSDP configuration — The model is small enough to fit entirely on each GPU. DDP only synchronizes gradients (a few hundred MB) rather than sharding the entire model.

Same infrastructure — Despite the simpler training approach, the job still runs on the same OpenShift cluster with Run:ai scheduling, shared storage, and network policies.

Enterprise training workflows often include management scripts with interactive menus for job lifecycle:

#!/bin/bash
set -euo pipefail

# Color definitions for terminal output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
CYAN='\033[0;36m'
BOLD='\033[1m'
NC='\033[0m'

print_header() {
    clear
    echo -e "${BOLD}${BLUE}═══════════════════════════════════════${NC}"
    echo -e "${BOLD}${BLUE}    🎯 Model Training Management       ${NC}"
    echo -e "${BOLD}${BLUE}═══════════════════════════════════════${NC}"
    echo ""
}

print_status() {
    echo -e "${CYAN}⏳ Checking job status '${JOB_NAME}'...${NC}"
    status=$(runai training standard describe "${JOB_NAME}" 2>/dev/null \
      | grep -i "Status" || echo "Job not submitted")
    echo -e " Job : ${BOLD}${JOB_NAME}${NC}"
    echo -e " Status: ${YELLOW}${status}${NC}"
    echo ""
}

action_submit() {
    echo -e "${GREEN}▶ Submitting job...${NC}"
    echo ""

    runai training pytorch submit ${JOB_NAME} \
      --image $IMAGE \
      --annotation "k8s.v1.cni.cncf.io/networks=ssa" \
      --extended-resource "openshift.io/mellanoxnics=1" \
      --large-shm \
      --workers 2 \
      --gpu-devices-request 1 \
      --cpu-memory-request 3846 \
      --cpu-memory-limit 8996 \
      --run-as-uid 2000 \
      --run-as-gid 2000 \
      --working-dir /data/scripts/vision/train \
      --environment-variable PYTORCH_ALLOC_CONF=expandable_segments:True \
      --environment-variable NCCL_DEBUG="INFO" \
      --environment-variable NCCL_IB_DISABLE=1 \
      --existing-pvc claimname=project-001,path=/data \
      --command -- /data/scripts/vision/shell/train-retinanet-ddp.sh
}

# Interactive menu: submit, status, logs, delete, exec into pod

DDP Training Configuration

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision.models.detection import retinanet_resnet50_fpn_v2

def setup_ddp():
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    torch.cuda.set_device(local_rank)
    return local_rank

def main():
    local_rank = setup_ddp()

    # RetinaNet with ResNet-50 FPN backbone
    model = retinanet_resnet50_fpn_v2(
        num_classes=601,  # Open Images V7 categories
        pretrained_backbone=True,
    )
    model = model.to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    # Optimizer
    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=0.02,
        momentum=0.9,
        weight_decay=1e-4,
    )

    # Learning rate scheduler with warmup
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer,
        T_max=num_epochs,
    )

    # Distributed sampler ensures each GPU gets different data
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_dataset,
        num_replicas=dist.get_world_size(),
        rank=dist.get_rank(),
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=4,
        sampler=train_sampler,
        num_workers=4,
        pin_memory=True,
        collate_fn=collate_fn,
    )

    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)  # Shuffle differently each epoch
        train_one_epoch(model, optimizer, train_loader, local_rank)
        scheduler.step()

        if dist.get_rank() == 0:
            # Only rank 0 saves checkpoints
            torch.save(model.module.state_dict(), f"retinanet_epoch_{epoch}.pt")

RetinaNet Architecture Recap

RetinaNet remains competitive in 2026 for its balance of speed and accuracy:

Backbone: ResNet-50 with Feature Pyramid Network (FPN)
Head: Classification + regression subnets on each FPN level
Loss: Focal Loss (handles class imbalance in dense detection)
Anchors: Multi-scale, multi-aspect-ratio anchor boxes

Input Image (800×800)
       │
   ResNet-50 Backbone
       │
   ┌───┼───┬───┬───┐
   P3  P4  P5  P6  P7   ← FPN levels (different scales)
   │   │   │   │   │
   ├───┴───┴───┴───┤
   │  Classification │  → per-anchor class probability
   │  subnet (4×conv)│
   ├─────────────────┤
   │  Regression     │  → per-anchor box offsets
   │  subnet (4×conv)│
   └─────────────────┘

Comparison: Vision vs LLM Training on Run:ai

Aspect	RetinaNet (DDP)	Mistral 4 (FSDP)
Model size	~34M params	~12B params
GPU memory	~4 GB	~80 GB per shard
Communication	Gradient sync only	Full model sharding
InfiniBand needed	No (`NCCL_IB_DISABLE=1`)	Yes (critical)
Training time	Hours	Days
Batch size per GPU	4-16 images	1 sequence
Data format	Images + boxes	Text tokens
Framework	torchvision	trl + peft + accelerate

Both workloads share the same Run:ai cluster, same GPU nodes, same PVC storage — but require fundamentally different distributed training strategies.

Fine-Tuning Mistral with FSDP and LoRA — LLM training on the same infrastructure
Distributed vs Multi-GPU Inference — from training to serving
NCCL Timeout Troubleshooting — debugging distributed training
NVIDIA Run:ai Distributed Inference — Run:ai platform overview
Multi-Tenant GPUs on Bare Metal — sharing GPU clusters across teams

DDP for vision, FSDP for LLMs. Same cluster, same scheduler, different parallelism strategies. The key is matching the distributed approach to the model’s memory footprint.

Training RetinaNet with DDP on Run:ai: Multi-Node Object

Architecture: DDP vs FSDP for Vision Models

Open Images Dataset V7

Downloading the Dataset

Run:ai Job Submission

Key Differences from LLM Training

The Training Script: Interactive Menu

DDP Training Configuration

RetinaNet Architecture Recap

Comparison: Vision vs LLM Training on Run:ai

Related Articles

Differential Privacy: How Math Protects Your Privacy

GLM-5.2 744B: Sparse Attention Meets Efficient MoE

Reliable AI Agents in Java with LangChain4J — Workshop

AI Gateway on Kubernetes: Route and Load-Balance LLM Traffic