How do you benchmark real-world work?

Most of the technical reviews of Windows Vista I've read recently focus on speeds and feeds. But does that granular approach miss the real point of owning and using a PC? Can any stopwatch-based measurement of isolated tasks performed by individual hardware and software components really measure the worth of a technology investment? I don't think so. What really matters is usability, a subject I've been thinking and writing about for nearly two decades now. But what's the best way to measure usability? The answer isn't as simple as you might think.

Adrian Kingsley-Hughes and I have been focusing lately on a tiny aspect of PC performance. He ran two sets of file management benchmarks on a test PC in his lab, I performed similar tests on a machine in my lab. Results? Inconclusive.

But are both of us missing the real point of owning and using a PC? Can any stopwatch-based measurement of isolated tasks as performed by individual hardware and software components really measure the worth of a technology investment? I don't think so.

This is not a new question for me. Back in the early 1990s, when I was editor of the late, lamented PC Computing, we differentiated our product reviews from those of sister public PC Magazine by focusing on usability. The highly regarded PC Magazine Labs was the quintessential "speeds and feeds" shop. We focused on usability, going to the extreme of spending a small fortune (I still remember the budget battles) building a state-of-the-art usability lab and hiring usability professionals to run it.

I liked our reviews better than the ones at PC Mag because we didn't have a one-size-fits-all conclusion. Instead, using the usability data, we tried to determine which product was a better fit for readers and prospective buyers with different needs. I think that approach still works today.

In the Talkback section of my earlier post, there's a lively discussion of what sort of benchmarking would work better than flawed speed tests that don't map to real world activities. The short version, from commenter frgough, says that Adrian and I should

simply do stopwatch tests on their normal daily workflow and see how the two operating systems compare, because, at the end of the day, that's what it comes down to.

Easier said than done. Here's a short list of lessons I learned from the PC Computing usability lab that are still valuable today:

Preconceptions affect perceptions. In the case of Windows Vista, that's a double whammy. The relentless drumbeat of "Vista sucks" press coverage is pretty hard to ignore. Try to find a usability tester who hasn't read any of that coverage and doesn't already have a bias going in.

Bad experiences affect perceptions too. The negative reviews of Vista are in many cases grounded in painful reality. There's no doubt that bad drivers, bugs in Vista itself, and crappy OEM hardware configurations caused a lot of early adopters to have unpleasant experiences with Windows Vista. Those initial impressions affect perceptions in a fundamental, hard-to-shake way. Even a minor problem can be painful if you don't know the solution. If it requires indeterminate amounts of troubleshooting to figure out why something doesn't work the way it's supposed to, that can be a deal-breaker.

The older, established system has a built-in advantage. Switching to a new computing platform involves unlearning old ways and learning new procedures (just look at the advice offered to people switching from Windows to a Mac). Initial productivity will be lower on the new system.

Are you testing learnability or usability? One trap that usability professionals warn about is the danger of disproportionately crediting a product that has a great out-of-box experience but doesn't deliver over the long haul. Jeff Atwood offers an excellent summary of the issues, capped by this great quote from Joel Spolsky:

If you did a usability test of cars, you would be forced to conclude that they are simply unusable.

Faster isn't always better. Simply measuring productivity by seeing who finishes first doesn't necessarily give you the right answer either. In the hands of someone who knows a system well, even a terrible design can be highly efficient. I can be tremendously productive at a command prompt and can probably finish many tasks faster with command-line tools. But if you forced me to choose between a command-line interface and a GUI for daily work I would choose the latter every time. I don't miss MS-DOS.

Sometimes there is no right answer. I talked with a usability professional at Microsoft recently who described an all-too-common real-world dilemma. The interface designers had to decide how the up arrow should work in a particular feature. There were only two possible choices. The trouble is, usability testing proved conclusively that 50% of the test subjects thought it should work one way, and 50% thought it should work the other way. No matter which design you choose, half of your customers will think you designed an unintuitive interface.

Ultimately, for mainstream business use and everyday consumer scenarios, I think usability is the key to measuring how well a piece of hardware performs. The trouble is finding the metrics to measure usability.

I'm interested in your thoughts. Regardless of which computing platform you use, what aspects of usability are important to you? Leave your thoughts in the Talkback section.