Personal digital archiving -- we need to fix the scale problem

We're living more of our lives in the digital realm. Makes sense to keep an archive of all that text, videographic, and photographic data. But there are some problem around scale...
Written by Matt Baxter-Reynolds, Contributor
Back in 1942, this is how we used to manage archives.

As we go through our lives, we tend to collect things. The easiest thing to collect -- the thing that we have onboard hardware to support from day one -- are memories.

Over time, this collection becomes an archive with physical manifestation. We collect knick-knacks and various bits in pieces in shoeboxes and other containers, kept for some moment when it contains some mundane information we need to access, or for times when we want to reflect on our lives.

As we change as a society to one that conducts more and more of our lives in the digital realm, that nature of that archive changes. It becomes trivial to collect every single email, tweet, status, documents, photo, and video on one enormous, ever growing personal digital archive.

By the time I die -- assuming it's not that a particularly imminent event -- there's a good chance I would have collected multiple exabytes of data. And there's nothing special about me, we'll likely all end up collecting that volume of information.

But there's a problem.

If we look at textual data -- let's say email -- we're rather good at creating technologies that allow us to add to and maintain that archive. My email archive has some quarter-million messages in it, and it's only 5GB.

When we look at photographic and videographic data, the volume that we can create is vast. A small amount of video from one child's birthday party would be larger than an email archive collected over many years.

We're going to want to add more and more data with greater richness at greater volume into our archives, but the way that we build such systems today are at the wrong scale.


The first scale problem to consider is that domestic, PC-based storage with ropey or non-existent backup is very cheap. If we assume 30 minutes of HD video can be compressed down to 5GB, we can by a 2TB drive for a PC for $100 and hold 200 hours of video.

You can think of this problem in terms of water flowing down a series of tubes. A standalone digital video camera camera emits data at a certain water flow, and this is easily caught in cheap containers, represented here by low-cost, domestic-class PC storage.

We can store multiple terabytes locally, cheaply, but we don't want to do that. What if the house burns down, or we lose our only copy? As a point of principle, we need to keep our archives in the cloud and not be responsible for storing them locally. Although some of you may balk at that idea, consider this point -- your bank never loses your bank accounts. Microsoft, Google, Amazon, Apple, etc will never lose your personal archive for the same technical and philosophical reasons.

But, when we actually get the data into the cloud, because cloud storage has to be enterprise-class (i.e. properly backed up, redundant, etc) the size container that you can buy with the same money is still smaller. This drives the cost up hugely.

Flickr recently announced that they were going to give everyone one terabyte of storage for free. Fine, but go up to two terabytes and it'll cost you $499.99 per year. This happens because Yahoo/Flickr hit the problem with enterprise cloud storage, even at the scale that it operates at.

Amazon's Glacier service offers much cheaper long-term archiving, based on the idea of using near-line, not online storage. Amazon will charge you about $122 per terabyte per year. But remember that you can buy a 2TB drive and keep it in a quasi-server in your house for $100. The cost of storing an individual archive needs to be something like $20 per year.

It's not immediately clear to me how to get around this side of the problem. Usually when I write a piece like this I have some idea of how to "square the circle". This time it's more of a "I think we'd all like to do this, but we don't seem to have any of the pieces we need". The costs involved in storage are just too high -- there's an opportunity here for something to do something radical.


The second scale problem is bandwidth. The bandwidth of a truck rolling down the highway is unimaginably vast compared to data transmission over fibre optics. Take a truck, pack it full of disks full of data and you can get those bits from New York to San Francisco in a tiny fraction of the time you could do it with a fibre optic cable.

So when we come to put that new data data in the cloud, we have a second problem. Our thick jet of data from the camera now has to pass down a capillary-sized tube into the cloud.

So now you have a camera squirting data at a high rate, dribbling that data through a tiny capillary, and then having multiple expensive containers at the other end to ultimately collect the data together in the cloud. It simply doesn't work.

In terms of the "getting the stuff up into the cloud" part, again I'm not sure we'll ever fix that. But the shape of a technical solution for that is more obvious.

We know that we have iPads and tablets as "post-PC" devices. What need is a "post-domestic-server" device. Those people who like to tinker and build servers for keeping their archive locally are going to be happy enough, but there's a huge swathe of people out there who will want something as simple as an iPad for managing that archive.

The problem is that getting stuff from a standalone video camera to the cloud requires too much nursemaiding. If there were a device in the house that could just grab data from the camera, hold it temporarily and locally, and then drip feed it up to the cloud, that would probably be OK for the vast majority of people. Then they could use web-based tools to review and edit the data in the archive. We know we're good at that part.


Personal digital archives, I believe, is an idea that's about to become of ages. Google Glass is a strange product, but for me at least this is a tool that is about lifelogging. Memoto is just one of what is sure to be many products like that coming down the line.

There's a huge opportunity here for someone to start putting dedicated, post-PC-era products out there in the market to help individuals and families create and manage their personal archives. The only wrinkle is that it's going to require some very radical thinking given how we build cloud-based systems today.


If you like the idea of digital personal archives, I was inspired to write this piece whilst reading Alastair Reynolds's excellent "House of Suns". Two elements of this book -- the life philosophy of Line members, and the technology involved in maintaining a "trove" -- seem to be a natural continuation of present-day ideas around lifelogging, and personal digital archives. 

Image credit: Wikimedia

Editorial standards