Skip to main content
πŸŽ“ Claude Code Masterclass Learn AI-assisted development on Udemy β€” plus the companion book on Leanpub & Amazon. Start Learning
Training RetinaNet with DDP on Run:ai multi-node object detection
AI

Training RetinaNet with DDP on Run:ai: Multi-Node Object

A production guide to training RetinaNet object detection models using PyTorch DDP on Run:ai with Open Images Dataset V7. Covers multi-node orchestration.

LB
Luca Berton
Β· 3 min read

While LLM fine-tuning dominates the conversation, computer vision training workloads remain critical in enterprise AI. Object detection models like RetinaNet power manufacturing inspection, autonomous systems, and medical imaging. Here’s how to train RetinaNet at scale using PyTorch DDP on Run:ai with OpenShift.

Architecture: DDP vs FSDP for Vision Models

For computer vision models (typically 30-100M parameters), you don’t need FSDP’s weight sharding β€” the model fits comfortably on a single GPU. Instead, Distributed Data Parallel (DDP) replicates the full model on each GPU and synchronizes gradients:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Node 0            β”‚     β”‚  Node 1            β”‚
β”‚  Full model copy   β”‚     β”‚  Full model copy   β”‚
β”‚  Batch shard A     │────▢│  Batch shard B     β”‚
β”‚  Compute gradients │◀────│  Compute gradients β”‚
β”‚                    β”‚     β”‚                    β”‚
β”‚  AllReduce grads   β”‚     β”‚  AllReduce grads   β”‚
β”‚  Update weights    β”‚     β”‚  Update weights    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
AspectDDP (Vision)FSDP (LLMs)
Model size30-100M params7B-405B params
Memory per GPUFull model (~200MB)Sharded (fraction)
CommunicationGradient AllReduce onlyParams + Grads + Optimizer
Use caseResNet, RetinaNet, YOLOLlama, Mistral, GPT
ComplexityLowHigh

Open Images Dataset V7

Open Images V7 is one of the largest annotated image datasets available:

  • 9 million images with image-level labels
  • 16 million bounding boxes across 600 categories
  • 2.5 million instance segmentations
  • Train/Validation/Test splits

Downloading the Dataset

# Download the official downloader script
wget https://raw.githubusercontent.com/openimages/dataset/master/downloader.py

# Create a text file with image IDs to download
# Format: $SPLIT/$IMAGE_ID
# Example:
# train/f9e0434389a1d4dd
# train/1a007563ebc18664
# test/ea8bfd4e765304db

# Download with parallel processes
python downloader.py $IMAGE_LIST_FILE \
  --download_folder=$DOWNLOAD_FOLDER \
  --num_processes=5

For enterprise environments, pre-download the dataset to a shared PVC (Persistent Volume Claim) accessible by all training nodes.

Run:ai Job Submission

The training job submission uses Run:ai’s PyTorch distributed training support:

#!/bin/bash
set -euo pipefail

export MSYS_NO_PATHCONV=1
export MSYS2_ARG_CONV_EXCL="*"

IMAGE="nvcr.io/nvidia/pytorch:26.02-py3"
JOB_NAME="retinanet-cpu-ddp"

runai training pytorch submit $JOB_NAME \
  --image $IMAGE \
  --annotation "k8s.v1.cni.cncf.io/networks=ssa" \
  --extended-resource "openshift.io/mellanoxnics=1" \
  --large-shm \
  --workers 2 \
  --gpu-devices-request 1 \
  --cpu-memory-request 3846 \
  --cpu-memory-limit 8996 \
  --run-as-uid 2000 \
  --run-as-gid 2000 \
  --working-dir /data/scripts/ia-gen-bench/llm/finetune-peft \
  --environment-variable PYTORCH_ALLOC_CONF=expandable_segments:True \
  --environment-variable NCCL_DEBUG="INFO" \
  --environment-variable NCCL_IB_DISABLE=1 \
  --environment-variable NCCL_SOCKET_NTHREADS=2 \
  --environment-variable NCCL_NSOCKS_PERTHREAD=2 \
  --environment-variable NCCL_SOCKET_IFNAME="net1" \
  --environment-variable CUDA_VISIBLE_DEVICES=0 \
  --existing-pvc claimname=project-001,path=/data \
  --command -- /data/scripts/train-retinanet-ddp.sh

Key Differences from LLM Training

NCCL_IB_DISABLE=1 β€” For DDP with smaller vision models, gradient AllReduce is lightweight enough that InfiniBand RDMA isn’t needed. TCP over the secondary network interface suffices, simplifying the setup.

No FSDP configuration β€” The model is small enough to fit entirely on each GPU. DDP only synchronizes gradients (a few hundred MB) rather than sharding the entire model.

Same infrastructure β€” Despite the simpler training approach, the job still runs on the same OpenShift cluster with Run:ai scheduling, shared storage, and network policies.

The Training Script: Interactive Menu

Enterprise training workflows often include management scripts with interactive menus for job lifecycle:

#!/bin/bash
set -euo pipefail

# Color definitions for terminal output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
CYAN='\033[0;36m'
BOLD='\033[1m'
NC='\033[0m'

print_header() {
    clear
    echo -e "${BOLD}${BLUE}═══════════════════════════════════════${NC}"
    echo -e "${BOLD}${BLUE}    🎯 Model Training Management       ${NC}"
    echo -e "${BOLD}${BLUE}═══════════════════════════════════════${NC}"
    echo ""
}

print_status() {
    echo -e "${CYAN}⏳ Checking job status '${JOB_NAME}'...${NC}"
    status=$(runai training standard describe "${JOB_NAME}" 2>/dev/null \
      | grep -i "Status" || echo "Job not submitted")
    echo -e " Job : ${BOLD}${JOB_NAME}${NC}"
    echo -e " Status: ${YELLOW}${status}${NC}"
    echo ""
}

action_submit() {
    echo -e "${GREEN}β–Ά Submitting job...${NC}"
    echo ""

    runai training pytorch submit ${JOB_NAME} \
      --image $IMAGE \
      --annotation "k8s.v1.cni.cncf.io/networks=ssa" \
      --extended-resource "openshift.io/mellanoxnics=1" \
      --large-shm \
      --workers 2 \
      --gpu-devices-request 1 \
      --cpu-memory-request 3846 \
      --cpu-memory-limit 8996 \
      --run-as-uid 2000 \
      --run-as-gid 2000 \
      --working-dir /data/scripts/vision/train \
      --environment-variable PYTORCH_ALLOC_CONF=expandable_segments:True \
      --environment-variable NCCL_DEBUG="INFO" \
      --environment-variable NCCL_IB_DISABLE=1 \
      --existing-pvc claimname=project-001,path=/data \
      --command -- /data/scripts/vision/shell/train-retinanet-ddp.sh
}

# Interactive menu: submit, status, logs, delete, exec into pod

DDP Training Configuration

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision.models.detection import retinanet_resnet50_fpn_v2

def setup_ddp():
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    torch.cuda.set_device(local_rank)
    return local_rank

def main():
    local_rank = setup_ddp()

    # RetinaNet with ResNet-50 FPN backbone
    model = retinanet_resnet50_fpn_v2(
        num_classes=601,  # Open Images V7 categories
        pretrained_backbone=True,
    )
    model = model.to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    # Optimizer
    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=0.02,
        momentum=0.9,
        weight_decay=1e-4,
    )

    # Learning rate scheduler with warmup
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer,
        T_max=num_epochs,
    )

    # Distributed sampler ensures each GPU gets different data
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_dataset,
        num_replicas=dist.get_world_size(),
        rank=dist.get_rank(),
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=4,
        sampler=train_sampler,
        num_workers=4,
        pin_memory=True,
        collate_fn=collate_fn,
    )

    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)  # Shuffle differently each epoch
        train_one_epoch(model, optimizer, train_loader, local_rank)
        scheduler.step()

        if dist.get_rank() == 0:
            # Only rank 0 saves checkpoints
            torch.save(model.module.state_dict(), f"retinanet_epoch_{epoch}.pt")

RetinaNet Architecture Recap

RetinaNet remains competitive in 2026 for its balance of speed and accuracy:

  • Backbone: ResNet-50 with Feature Pyramid Network (FPN)
  • Head: Classification + regression subnets on each FPN level
  • Loss: Focal Loss (handles class imbalance in dense detection)
  • Anchors: Multi-scale, multi-aspect-ratio anchor boxes
Input Image (800Γ—800)
       β”‚
   ResNet-50 Backbone
       β”‚
   β”Œβ”€β”€β”€β”Όβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”
   P3  P4  P5  P6  P7   ← FPN levels (different scales)
   β”‚   β”‚   β”‚   β”‚   β”‚
   β”œβ”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€
   β”‚  Classification β”‚  β†’ per-anchor class probability
   β”‚  subnet (4Γ—conv)β”‚
   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
   β”‚  Regression     β”‚  β†’ per-anchor box offsets
   β”‚  subnet (4Γ—conv)β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Comparison: Vision vs LLM Training on Run:ai

AspectRetinaNet (DDP)Mistral 4 (FSDP)
Model size~34M params~12B params
GPU memory~4 GB~80 GB per shard
CommunicationGradient sync onlyFull model sharding
InfiniBand neededNo (NCCL_IB_DISABLE=1)Yes (critical)
Training timeHoursDays
Batch size per GPU4-16 images1 sequence
Data formatImages + boxesText tokens
Frameworktorchvisiontrl + peft + accelerate

Both workloads share the same Run:ai cluster, same GPU nodes, same PVC storage β€” but require fundamentally different distributed training strategies.


DDP for vision, FSDP for LLMs. Same cluster, same scheduler, different parallelism strategies. The key is matching the distributed approach to the model’s memory footprint.

Free 30-min AI & Cloud consultation

Book Now