From Deadlocks to DDP: My Journey Building a 3-Node PyTorch Cluster

When you are training deep learning models, you inevitably hit a wall where a single GPU just isn’t cutting it anymore. The industry standard for scaling compute across multiple machines is PyTorch’s Distributed Data Parallel (DDP). But reading the documentation is one thing; actually building a distributed network from scratch is a completely different beast. I didn’t want to just rent a cloud cluster. I wanted to understand the absolute lowest-level plumbing of how computers talk to each other to solve math. So, I decided to build a local simulation lab. No fancy Kubernetes, no Slurm workload managers—just pure Debian Virtual Machines, a local network bridge, and a lot of debugging. ...

March 28, 2026 · 6 min · 1221 words · Chinmay Dharmik

I Broke My Fedora Setup... So I Rebuilt It with QEMU (And Never Looked Back)

Lab Entry — Day 0: “Windows update ran. Fedora didn’t boot. Data recovered. Sanity: debatable.” The Break That Started Everything I didn’t plan to go deep into virtualisation. It happened because something broke. What Actually Happened (The Timeline) Here’s the sequence, because the specifics matter: Time Event T+0 Windows runs an update in the background on the shared drive T+0 Windows update rewrites the bootloader, blows past GRUB T+1 min Reboot. GRUB gone. Fedora doesn’t boot. T+2 hrs Live USB recovery attempt. Chroot. grub2-install. Partial recovery. T+4 hrs Most data recovered. Configs partially intact. Some work lost. T+1 day Full reinstall. Reconfiguration. Environment rebuilt from scratch. The data loss was manageable. The downtime was the actual problem. A full day of rebuilding an environment that should have never broken in the first place. ...

March 12, 2026 · 20 min · 4067 words · Chinmay Dharmik