While LLM fine-tuning dominates the conversation, computer vision training workloads remain critical in enterprise AI. Object detection models like RetinaNet power manufacturing inspection, autonomous systems, and medical imaging. Hereβs how to train RetinaNet at scale using PyTorch DDP on Run:ai with OpenShift.
Architecture: DDP vs FSDP for Vision Models
For computer vision models (typically 30-100M parameters), you donβt need FSDPβs weight sharding β the model fits comfortably on a single GPU. Instead, Distributed Data Parallel (DDP) replicates the full model on each GPU and synchronizes gradients:
ββββββββββββββββββββββ ββββββββββββββββββββββ
β Node 0 β β Node 1 β
β Full model copy β β Full model copy β
β Batch shard A ββββββΆβ Batch shard B β
β Compute gradients βββββββ Compute gradients β
β β β β
β AllReduce grads β β AllReduce grads β
β Update weights β β Update weights β
ββββββββββββββββββββββ ββββββββββββββββββββββ| Aspect | DDP (Vision) | FSDP (LLMs) |
|---|---|---|
| Model size | 30-100M params | 7B-405B params |
| Memory per GPU | Full model (~200MB) | Sharded (fraction) |
| Communication | Gradient AllReduce only | Params + Grads + Optimizer |
| Use case | ResNet, RetinaNet, YOLO | Llama, Mistral, GPT |
| Complexity | Low | High |
Open Images Dataset V7
Open Images V7 is one of the largest annotated image datasets available:
- 9 million images with image-level labels
- 16 million bounding boxes across 600 categories
- 2.5 million instance segmentations
- Train/Validation/Test splits
Downloading the Dataset
# Download the official downloader script
wget https://raw.githubusercontent.com/openimages/dataset/master/downloader.py
# Create a text file with image IDs to download
# Format: $SPLIT/$IMAGE_ID
# Example:
# train/f9e0434389a1d4dd
# train/1a007563ebc18664
# test/ea8bfd4e765304db
# Download with parallel processes
python downloader.py $IMAGE_LIST_FILE \
--download_folder=$DOWNLOAD_FOLDER \
--num_processes=5For enterprise environments, pre-download the dataset to a shared PVC (Persistent Volume Claim) accessible by all training nodes.
Run:ai Job Submission
The training job submission uses Run:aiβs PyTorch distributed training support:
#!/bin/bash
set -euo pipefail
export MSYS_NO_PATHCONV=1
export MSYS2_ARG_CONV_EXCL="*"
IMAGE="nvcr.io/nvidia/pytorch:26.02-py3"
JOB_NAME="retinanet-cpu-ddp"
runai training pytorch submit $JOB_NAME \
--image $IMAGE \
--annotation "k8s.v1.cni.cncf.io/networks=ssa" \
--extended-resource "openshift.io/mellanoxnics=1" \
--large-shm \
--workers 2 \
--gpu-devices-request 1 \
--cpu-memory-request 3846 \
--cpu-memory-limit 8996 \
--run-as-uid 2000 \
--run-as-gid 2000 \
--working-dir /data/scripts/ia-gen-bench/llm/finetune-peft \
--environment-variable PYTORCH_ALLOC_CONF=expandable_segments:True \
--environment-variable NCCL_DEBUG="INFO" \
--environment-variable NCCL_IB_DISABLE=1 \
--environment-variable NCCL_SOCKET_NTHREADS=2 \
--environment-variable NCCL_NSOCKS_PERTHREAD=2 \
--environment-variable NCCL_SOCKET_IFNAME="net1" \
--environment-variable CUDA_VISIBLE_DEVICES=0 \
--existing-pvc claimname=project-001,path=/data \
--command -- /data/scripts/train-retinanet-ddp.shKey Differences from LLM Training
NCCL_IB_DISABLE=1 β For DDP with smaller vision models, gradient AllReduce is lightweight enough that InfiniBand RDMA isnβt needed. TCP over the secondary network interface suffices, simplifying the setup.
No FSDP configuration β The model is small enough to fit entirely on each GPU. DDP only synchronizes gradients (a few hundred MB) rather than sharding the entire model.
Same infrastructure β Despite the simpler training approach, the job still runs on the same OpenShift cluster with Run:ai scheduling, shared storage, and network policies.
The Training Script: Interactive Menu
Enterprise training workflows often include management scripts with interactive menus for job lifecycle:
#!/bin/bash
set -euo pipefail
# Color definitions for terminal output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
CYAN='\033[0;36m'
BOLD='\033[1m'
NC='\033[0m'
print_header() {
clear
echo -e "${BOLD}${BLUE}βββββββββββββββββββββββββββββββββββββββ${NC}"
echo -e "${BOLD}${BLUE} π― Model Training Management ${NC}"
echo -e "${BOLD}${BLUE}βββββββββββββββββββββββββββββββββββββββ${NC}"
echo ""
}
print_status() {
echo -e "${CYAN}β³ Checking job status '${JOB_NAME}'...${NC}"
status=$(runai training standard describe "${JOB_NAME}" 2>/dev/null \
| grep -i "Status" || echo "Job not submitted")
echo -e " Job : ${BOLD}${JOB_NAME}${NC}"
echo -e " Status: ${YELLOW}${status}${NC}"
echo ""
}
action_submit() {
echo -e "${GREEN}βΆ Submitting job...${NC}"
echo ""
runai training pytorch submit ${JOB_NAME} \
--image $IMAGE \
--annotation "k8s.v1.cni.cncf.io/networks=ssa" \
--extended-resource "openshift.io/mellanoxnics=1" \
--large-shm \
--workers 2 \
--gpu-devices-request 1 \
--cpu-memory-request 3846 \
--cpu-memory-limit 8996 \
--run-as-uid 2000 \
--run-as-gid 2000 \
--working-dir /data/scripts/vision/train \
--environment-variable PYTORCH_ALLOC_CONF=expandable_segments:True \
--environment-variable NCCL_DEBUG="INFO" \
--environment-variable NCCL_IB_DISABLE=1 \
--existing-pvc claimname=project-001,path=/data \
--command -- /data/scripts/vision/shell/train-retinanet-ddp.sh
}
# Interactive menu: submit, status, logs, delete, exec into podDDP Training Configuration
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision.models.detection import retinanet_resnet50_fpn_v2
def setup_ddp():
dist.init_process_group(backend="nccl")
local_rank = int(os.environ.get("LOCAL_RANK", 0))
torch.cuda.set_device(local_rank)
return local_rank
def main():
local_rank = setup_ddp()
# RetinaNet with ResNet-50 FPN backbone
model = retinanet_resnet50_fpn_v2(
num_classes=601, # Open Images V7 categories
pretrained_backbone=True,
)
model = model.to(local_rank)
model = DDP(model, device_ids=[local_rank])
# Optimizer
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.02,
momentum=0.9,
weight_decay=1e-4,
)
# Learning rate scheduler with warmup
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=num_epochs,
)
# Distributed sampler ensures each GPU gets different data
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=dist.get_world_size(),
rank=dist.get_rank(),
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=4,
sampler=train_sampler,
num_workers=4,
pin_memory=True,
collate_fn=collate_fn,
)
for epoch in range(num_epochs):
train_sampler.set_epoch(epoch) # Shuffle differently each epoch
train_one_epoch(model, optimizer, train_loader, local_rank)
scheduler.step()
if dist.get_rank() == 0:
# Only rank 0 saves checkpoints
torch.save(model.module.state_dict(), f"retinanet_epoch_{epoch}.pt")RetinaNet Architecture Recap
RetinaNet remains competitive in 2026 for its balance of speed and accuracy:
- Backbone: ResNet-50 with Feature Pyramid Network (FPN)
- Head: Classification + regression subnets on each FPN level
- Loss: Focal Loss (handles class imbalance in dense detection)
- Anchors: Multi-scale, multi-aspect-ratio anchor boxes
Input Image (800Γ800)
β
ResNet-50 Backbone
β
βββββΌββββ¬ββββ¬ββββ
P3 P4 P5 P6 P7 β FPN levels (different scales)
β β β β β
βββββ΄ββββ΄ββββ΄ββββ€
β Classification β β per-anchor class probability
β subnet (4Γconv)β
βββββββββββββββββββ€
β Regression β β per-anchor box offsets
β subnet (4Γconv)β
βββββββββββββββββββComparison: Vision vs LLM Training on Run:ai
| Aspect | RetinaNet (DDP) | Mistral 4 (FSDP) |
|---|---|---|
| Model size | ~34M params | ~12B params |
| GPU memory | ~4 GB | ~80 GB per shard |
| Communication | Gradient sync only | Full model sharding |
| InfiniBand needed | No (NCCL_IB_DISABLE=1) | Yes (critical) |
| Training time | Hours | Days |
| Batch size per GPU | 4-16 images | 1 sequence |
| Data format | Images + boxes | Text tokens |
| Framework | torchvision | trl + peft + accelerate |
Both workloads share the same Run:ai cluster, same GPU nodes, same PVC storage β but require fundamentally different distributed training strategies.
Related Articles
- Fine-Tuning Mistral with FSDP and LoRA β LLM training on the same infrastructure
- Distributed vs Multi-GPU Inference β from training to serving
- NCCL Timeout Troubleshooting β debugging distributed training
- NVIDIA Run:ai Distributed Inference β Run:ai platform overview
- Multi-Tenant GPUs on Bare Metal β sharing GPU clusters across teams
DDP for vision, FSDP for LLMs. Same cluster, same scheduler, different parallelism strategies. The key is matching the distributed approach to the modelβs memory footprint.