Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway

[ music | Depeche Mode – Route 66 ]

I was in Houston a few weeks ago, and I had many hundreds of thousands of multi-megabyte files I need to shove through a 20Mbit upstream connection to Azure. Strictly speaking the problem here wasn’t how long it would take to send the file data, that could be done in two-ish days on that pipe. However, 500GB of data takes a lot more time than 55 hours when it’s spread over hundreds of thousands of files being written to a Windows server on NTFS. This transfer was looking to take more than a week to complete.

To keep track of all your files on a disk, the file system has features to keep “catalogs” of everything. Just like the card catalog in a library tells you where every book is, there are similar features in a filesystem to keep track of where every part of a file is, where the empty space is, etc. So when you create a file, the file system has to keep track of every single action, recording where every part of every file is as it happens. There can be a lot of time overhead as this data is created, recorded, and verified, much mroe so than just dumping bytes to the disk. When you’re creating hundreds of thousands of files across a slow link, every single millisecond of extra delay per file adds up to about nine minutes across the overall transfer.

The solution? My plane trip home. We slapped an NVMe SSD into the VMware server, created a new disk on the virtual server, copied the files, and I took the NVMe drive home where I have a symmetrical gigabit internet connection (it’s 1 gig up and down, most cable internet connections ar emuch faster down than up). I popped the drive in on of my home VMware servers, attached it to a Windows machine, created the VPN link in the VM to Azure. Now I still had the problem of 500,000 files, however.

It still would have taken a couple days with file system overhead across a network link. But it only took a couple of hours to zip up all the files into a single giant ZIP file, which took 2 hours to upload, and then 3 hours to decompress it on the other end. Local file system access is always faster than network file system access, so zipping and unzipping the file was far, far faster than the network transfer would have been.

This is why almost 40 years ago Andrew Tannenbaum came up with the brilliant quote, “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.