From Deadlocks to DDP: My Journey Building a 3-Node PyTorch Cluster

When you are training deep learning models, you inevitably hit a wall where a single GPU just isn’t cutting it anymore. The industry standard for scaling compute across multiple machines is PyTorch’s Distributed Data Parallel (DDP).

But reading the documentation is one thing; actually building a distributed network from scratch is a completely different beast.

I didn’t want to just rent a cloud cluster. I wanted to understand the absolute lowest-level plumbing of how computers talk to each other to solve math. So, I decided to build a local simulation lab. No fancy Kubernetes, no Slurm workload managers—just pure Debian Virtual Machines, a local network bridge, and a lot of debugging.

Here is the experiment log of how I built a 3-node PyTorch cluster, the deadlocks I hit, and the absolute rush of finally getting them to sync.

The Hardware: A Cluster in a Backpack 💻

Before we get into the networking, here is the hardware running this entire experiment. I am hosting this “cluster” locally on my daily driver:

Host OS: Fedora Linux 43 (Toolbx Container)
Machine: Acer Predator PHN16-71
CPU: 13th Gen Intel Core i7-13700HX
Memory: 32GB RAM (Running at about 73% utilization during the tests)
GPU: NVIDIA GeForce RTX 4070 Max-Q (Though for this experiment, we are going strictly CPU-bound to test the network).

Inside this host, I spun up three Debian VMs using virt-manager:

vm-master (192.168.122.230)
vm-worker-1 (192.168.122.189)
vm-worker-2 (192.168.122.185)

Phase 1: The Vocabulary of Distributed Training 🧠

In a DDP architecture, every machine holds an identical replica of your model. During training, the dataset is sliced up, and each node processes a different batch simultaneously. During the loss.backward() step, they pause, synchronize their gradients over the network, and step together.

To make this happen, the cluster needs to know who is who.

World Size: The total number of processes (3 in my case).
Rank: A unique identifier from 0 to 2.
Backend: Since I am simulating this on CPU VMs, I can’t use NVIDIA’s NCCL. Instead, I used the Gloo backend, which is designed for CPU-to-CPU communication over standard Ethernet.

Here is what the architecture looks like:

Phase 2: The “Aha!” Moment and The 0.0% CPU Deadlock 🛑

My first attempt at launching the cluster used standard environment variables. I wrote the code, spun up the three terminals, ran the Python script, and… nothing. The terminals just sat there. No errors. No crashing. Just a deafening silence.

I SSH’d into the Master node and ran top. The python process was sitting at exactly 0.0% CPU utilization with a status of “Sleeping”. It was a classic Deadlock. dist.init_process_group is a blocking call; the script physically halts until every node checks in. But why weren’t they checking in?

I looked at the environment variables I had manually exported on my third terminal: export RANK=0

Wait. I had accidentally assigned RANK=0 to both the Master and Worker 2. In a distributed cluster, there can only be one Rank 0. It’s the captain. By assigning two captains, the cluster deadlocked because both nodes were sitting there waiting for Rank 1 and Rank 2 to report to them.

I quickly fixed the export on Worker 2: export RANK=2

I hit enter. Instantly, all three terminal windows exploded with verbose logs scrolling in perfect unison. top jumped from 0.0% to 100% CPU usage across all Virtual Machines. The barrier dropped, the nodes shook hands, and they started crunching MNIST batches together.

It was amazing. That specific feeling of troubleshooting a distributed system until it finally clicks into place is unmatched.

Phase 3: Bypassing `torchrun` and Forcing the Handshake 🌐

Normally, people use the torchrun utility to automate this. But in a highly virtualized environment, torchrun often gets confused by loopback addresses and virtual bridges. My nodes were throwing TimeoutError and socket.gaierror left and right because they couldn’t find the Master’s port 29500.

To fix this, I had to drop the automation and explicitly force PyTorch to use my virtual bridge interface (enp1s0).

I created a custom launch.sh bash script for each VM:

#!/bin/bash
# 1. Kill old zombie processes
pkill -9 python

# 2. Set the Environment
export MASTER_ADDR=192.168.122.230
export MASTER_PORT=29500
export WORLD_SIZE=3
export RANK=$1
export LOCAL_RANK=0

# The Magic Fix: Force the network pipe
export GLOO_SOCKET_IFNAME=enp1s0  

# 3. Launch
echo "Launching as Rank $RANK..."
./.venv/bin/python train.py

Phase 4: Data Barriers and Neural Network Profiling ⏱️

Writing the train.py script for DDP requires some careful maneuvering. If all three VMs try to download the MNIST dataset to the shared disk at the exact same time, the file gets corrupted.

To solve this, you use a Barrier:

# Only Rank 0 downloads the data
if local_rank == 0:
    datasets.MNIST(root='./data', train=True, download=True)

# Everyone else waits at this wall until Rank 0 is done
dist.barrier() 

# Now all ranks can safely load the data from disk
train_dataset = datasets.MNIST(root='./data', train=True, download=False)

I also added the PyTorch Profiler around my training loop. I wanted to see the “Distributed Tax.” Because I am using Gloo over a virtual network bridge, every gradient sync requires TCP packets.

When the profiler output printed, it revealed the truth of Amdahl’s Law: for a model as small as my 2-layer CNN, the dist::all_reduce (communication) actually took longer than the model_inference (computation). It was the perfect visual proof that you only scale to multiple nodes when your math takes significantly longer than your network lag.

Phase 5: The “One-Click” Homelab Cluster 🚀

Opening three terminals to start a run gets old fast. For the final touch, I automated the entire lab.

First, I generated SSH keys on the Master and pushed them to the workers so I wouldn’t have to type passwords. Then, I wrote an orchestration script that uses rsync to push any code changes from the Master to the workers, starts the workers in the background, and launches the Master in the foreground.

#!/bin/bash
# sync_cluster.sh && cluster_run.sh combined concept:

# 1. Push code delta
rsync -avz --exclude '.venv' ./ user@192.168.122.189:~/distributed_computing/
rsync -avz --exclude '.venv' ./ user@192.168.122.185:~/distributed_computing/

# 2. Start Workers in background
ssh user@192.168.122.185 "cd ~/distributed_computing && ./launch.sh 2" &
ssh user@192.168.122.189 "cd ~/distributed_computing && ./launch.sh 1" &

# 3. Start Master in foreground
./launch.sh 0

# Cleanup on exit
trap "kill 0" EXIT

Now, I tweak my neural network on the Master, type ./cluster_run.sh, and watch my 3-node homelab spring to life completely autonomously.

The Verdict

Building a distributed cluster from scratch is frustrating, tedious, and incredibly rewarding. When you bypass the cloud abstractions and physically wire up the RANK logic yourself, the “magic” of AI infrastructure disappears, and you are left with a profound understanding of how systems actually scale.

What’s next? Now that the CPU/Gloo plumbing is bulletproof, the logical next step is to introduce actual silicon. I might attempt to bridge my laptop’s RTX 4070 with an external eGPU setup, or completely overhaul this logic for an LLM fine-tuning run.

If you have a laptop with some RAM to spare, I highly challenge you to spin up a few VMs and try to get them to shake hands. You will learn more about PyTorch in that weekend than you will in a month of reading docs.

The Hardware: A Cluster in a Backpack 💻#

Phase 1: The Vocabulary of Distributed Training 🧠#

Phase 2: The “Aha!” Moment and The 0.0% CPU Deadlock 🛑#

Phase 3: Bypassing torchrun and Forcing the Handshake 🌐#

Phase 4: Data Barriers and Neural Network Profiling ⏱️#

Phase 5: The “One-Click” Homelab Cluster 🚀#

The Verdict#