Too much information stores up problems

Too much information is as bad as not enough, but this fundamental problem needs a new approach.

There is an unnamed corollary to Moore's Law that while hard disk storage capacity is growing exponentially, the time taken to fill it is a constant. A year ago, I moved from a twenty-gig drive to an eighty: today, it is officially stuffed to the gunwales.

All my life, I've been battling with storage. While I'm still in my thirties the problem has been with us for nearly fifty years, ever since IBM built the 305 RAMAC in 1956. That first hard disk system used fifty 24-inch diameter platters to serve up five megabytes: a quick check on my system shows I have 2,653 files larger than that, and I don't know what most of them are. Progress.

Information technology isn't very good at applying technology to information. It's fantastically good at acquiring data and stuffing it into cubbyholes -- but information is more than bits and bytes. Information is data in context: lose the context, and all you've got is gibberish and a full hard disk.

The problem is universal and has funded Larry Ellison's toy habit for decades. We are trying to force the entire world, with all its complexities and rough edges, into numbers that follow rules and models that match reality. Those with a taste for Borges will feel right at home with this fascinating and important practical and philosophical problem, and as we move our society and economy into the digital age it will become ever more so. I can't wait to see what happens next.

Meanwhile, however, my disk is full.

It is full because I have wholeheartedly subscribed to my role in the digital diaspora and now subsist on a sensorial diet of bits. My physical music collection gathers dust, my video is streamed, my personal and business communications are mediated through email and online voicemail. The stuff comes in at a megabit a second, and it never goes away. To deal with it, I have to know what it is.

There is a classic solution to managing information, first proposed by Harvard psychologist George A Miller at around the same time that IBM was building the 305 RAMAC. Called chunking, it's the principle that information can only be managed or communicated -- the two tasks are intimately intertwined -- by breaking it down into small sets of contextually related ideas. Miller looked at short-term human memory, and deduced that people can cope with around seven ideas at once. More than that and they have to throw something away in making a decision. These days, we'd call it cache management. It is a powerful and effective concept, and it has been totally ignored by software designers.

So what is a data collector to do? My approach to fixing my personal storage crisis is instinctively chunky. First step: work out the classes of file clogging my informational arteries. Temporary files, Zip files, stuff that can be downloaded again, files in formats for applications I no longer have: all can be located across the disk and disposed of without the right of appeal. How do I do this? A mixture of battering Windows' file finder to death and conjuring hand-rolled DOS batch files -- and yes, I know about Perl and refuse to let something that ugly into my psyche. I have problems enough.

Then it's down to sorting out and de-duplicating my media files. The simplest of tasks -- find and report on identical music tracks -- is beyond anything I've tried. There are a lot of MP3 managers out there but none I've found will create the categories I want: duplicated files, the directories in which they occur, files in those directories which aren't duplicated. Simple classifications that let me, the human nominally in charge, control my information in a way I understand. No chance. Everything presents the data in ways the computer understands -- by artist field, by size, by bit-rate -- because that's much easier for the programmers. As for the user interfaces: there is an argument that IT's primary role in society is as occupational therapy for people who don't understand other people very well, and after days in shareware file management hell I consider it proven beyond doubt.

As so often, open-source software has the potential to make progress far faster here than traditional methods, which can charitably be considered to have failed. Closed teams of software engineers reliant on corporate resources and priorities cannot easily involve outside experts in human psychology: open software development, if it can tempt such animals into the game, has no such restrictions. There is no reason why online centres of excellence in understanding humans can't plug into the process.

To fix the storage management problem -- and many others besides -- open source should break free of the model of small teams of artisans doing one project. Instead, specialisations should evolve, capable of working on many projects sequentially and independently. It would look a bit like Henry Ford's moving assembly line. Cognitive experts can define a problem in sensible ways, programmers can apply their craft to the underlying logic and user interface gurus can make sure the results don't have the personality of a Neanderthal with a hangover. Once you've worked on your bit of the project, hand it on and do something else.

Managing this process, and looking after the ego and money requirements of all concerned, can be left to someone who wants to become immensely famous and hailed as the Ford of the 21st century. The job's yours if you want it.

Until then, how am I going to solve my personal storage problem? There is a 250GB hard disk sitting to one side of my PC: I'll fit it, copy everything across and reset the timer. Brawn wins over brains once again. Not much of a legacy for fifty years of pain.