X
Tech

Getting to the root of things

I got up the other morning to find that my network wasn't behaving at all well. Somewhere along the line I was getting significant amounts of packet loss, and pings were taking an order of magnitude longer than normal to travel up my DSL connection.
Written by Simon Bisson, Contributor and  Mary Branscombe, Contributor

I got up the other morning to find that my network wasn't behaving at all well. Somewhere along the line I was getting significant amounts of packet loss, and pings were taking an order of magnitude longer than normal to travel up my DSL connection.

It took a while to diagnose the problem. I first tried several web-based tools to test my connectivity, but I didn't learn anything from them that I hadn't learnt from a ping or two and a traceroute or three. Latency was way up, and there was definite packet loss. So I started delving into the diagnostic tools built into my router, which let me see what's using my network connection – and just how much bandwidth is being used by each machine.

One machine in particular was using a lot of bandwidth, more than enough to be causing the problems. Now I had something to investigate.

The IP address of the network hog was the address of a test Windows Server 8 VM I'd set up a day or so before. It was also the VM I'd chosen to use to test Microsoft's new server cloud backup service the previous evening.

I fired up a remote desktop session, logged in, and discovered that the initial backup job that I'd scheduled for 3am was still running, and it was attempting to use all my 1Mbps upstream to create its first full disk replica. I cancelled the job, and checked the state of the network. Pings were back to normal, as was network speed. I'd found the reason why the network connection had become congested, but I needed to understand just why things had gone wrong. After all, I'd been using cloud backup tools like CrashPlan and Mozy on desktop and mobile devices for years with no problems.

And that initial assumption was the root of my problems. I'd assumed an enterprise cloud backup tool for a server would behave just like a SME or consumer service, which, as you can guess, was a mistake. It's not as if it wasn't staring me in the face, either, as while I was clicking through the various set up dialogues I'd skipped through one that let you throttle the service during working hours. Without setting bandwidth limits, I'd let the server start sending its initial disk image into the cloud as fast as possible. With a gigabit network in the office, and only an ADSL 2+ connection to the outside world, there was a bottleneck and one that turned out to be all too easy to fill.

So what were the lessons?

Firstly I failed to check what the bandwidth requirements of the backup service would be. Secondly assuming that it would behave just like other cloud backup services I'd used. Thirdly, in not actually reading a dialog box I clicked through.

That final one is the real problem. I was doing too many things at once and just clicked through a dialog box without checking the tabs. Just because it looked like the familiar Windows Server Backup tool didn't mean it would behave like it…

Root cause investigations aren't just for the big cloud providers. They're for your networks, and your private clouds. They don't need to be formal, either, all you need to do is document went wrong, and why. If you don't carry out some form of analysis after something has gone wrong, and document your findings, you'll make the same mistakes and use the same faulty assumptions again and again – and your colleagues won't get to share in the benefits of the lessons you've learned.

Simon Bisson

Editorial standards