Personal digital archiving -- we need to fix the scale problem

Personal digital archiving -- we need to fix the scale problem

Summary: We're living more of our lives in the digital realm. Makes sense to keep an archive of all that text, videographic, and photographic data. But there are some problem around scale...

Back in 1942, this is how we used to manage archives.

As we go through our lives, we tend to collect things. The easiest thing to collect -- the thing that we have onboard hardware to support from day one -- are memories.

Over time, this collection becomes an archive with physical manifestation. We collect knick-knacks and various bits in pieces in shoeboxes and other containers, kept for some moment when it contains some mundane information we need to access, or for times when we want to reflect on our lives.

As we change as a society to one that conducts more and more of our lives in the digital realm, that nature of that archive changes. It becomes trivial to collect every single email, tweet, status, documents, photo, and video on one enormous, ever growing personal digital archive.

By the time I die -- assuming it's not that a particularly imminent event -- there's a good chance I would have collected multiple exabytes of data. And there's nothing special about me, we'll likely all end up collecting that volume of information.

But there's a problem.

If we look at textual data -- let's say email -- we're rather good at creating technologies that allow us to add to and maintain that archive. My email archive has some quarter-million messages in it, and it's only 5GB.

When we look at photographic and videographic data, the volume that we can create is vast. A small amount of video from one child's birthday party would be larger than an email archive collected over many years.

We're going to want to add more and more data with greater richness at greater volume into our archives, but the way that we build such systems today are at the wrong scale.


The first scale problem to consider is that domestic, PC-based storage with ropey or non-existent backup is very cheap. If we assume 30 minutes of HD video can be compressed down to 5GB, we can by a 2TB drive for a PC for $100 and hold 200 hours of video.

You can think of this problem in terms of water flowing down a series of tubes. A standalone digital video camera camera emits data at a certain water flow, and this is easily caught in cheap containers, represented here by low-cost, domestic-class PC storage.

We can store multiple terabytes locally, cheaply, but we don't want to do that. What if the house burns down, or we lose our only copy? As a point of principle, we need to keep our archives in the cloud and not be responsible for storing them locally. Although some of you may balk at that idea, consider this point -- your bank never loses your bank accounts. Microsoft, Google, Amazon, Apple, etc will never lose your personal archive for the same technical and philosophical reasons.

But, when we actually get the data into the cloud, because cloud storage has to be enterprise-class (i.e. properly backed up, redundant, etc) the size container that you can buy with the same money is still smaller. This drives the cost up hugely.

Flickr recently announced that they were going to give everyone one terabyte of storage for free. Fine, but go up to two terabytes and it'll cost you $499.99 per year. This happens because Yahoo/Flickr hit the problem with enterprise cloud storage, even at the scale that it operates at.

Amazon's Glacier service offers much cheaper long-term archiving, based on the idea of using near-line, not online storage. Amazon will charge you about $122 per terabyte per year. But remember that you can buy a 2TB drive and keep it in a quasi-server in your house for $100. The cost of storing an individual archive needs to be something like $20 per year.

It's not immediately clear to me how to get around this side of the problem. Usually when I write a piece like this I have some idea of how to "square the circle". This time it's more of a "I think we'd all like to do this, but we don't seem to have any of the pieces we need". The costs involved in storage are just too high -- there's an opportunity here for something to do something radical.


The second scale problem is bandwidth. The bandwidth of a truck rolling down the highway is unimaginably vast compared to data transmission over fibre optics. Take a truck, pack it full of disks full of data and you can get those bits from New York to San Francisco in a tiny fraction of the time you could do it with a fibre optic cable.

So when we come to put that new data data in the cloud, we have a second problem. Our thick jet of data from the camera now has to pass down a capillary-sized tube into the cloud.

So now you have a camera squirting data at a high rate, dribbling that data through a tiny capillary, and then having multiple expensive containers at the other end to ultimately collect the data together in the cloud. It simply doesn't work.

In terms of the "getting the stuff up into the cloud" part, again I'm not sure we'll ever fix that. But the shape of a technical solution for that is more obvious.

We know that we have iPads and tablets as "post-PC" devices. What need is a "post-domestic-server" device. Those people who like to tinker and build servers for keeping their archive locally are going to be happy enough, but there's a huge swathe of people out there who will want something as simple as an iPad for managing that archive.

The problem is that getting stuff from a standalone video camera to the cloud requires too much nursemaiding. If there were a device in the house that could just grab data from the camera, hold it temporarily and locally, and then drip feed it up to the cloud, that would probably be OK for the vast majority of people. Then they could use web-based tools to review and edit the data in the archive. We know we're good at that part.


Personal digital archives, I believe, is an idea that's about to become of ages. Google Glass is a strange product, but for me at least this is a tool that is about lifelogging. Memoto is just one of what is sure to be many products like that coming down the line.

There's a huge opportunity here for someone to start putting dedicated, post-PC-era products out there in the market to help individuals and families create and manage their personal archives. The only wrinkle is that it's going to require some very radical thinking given how we build cloud-based systems today.


If you like the idea of digital personal archives, I was inspired to write this piece whilst reading Alastair Reynolds's excellent "House of Suns". Two elements of this book -- the life philosophy of Line members, and the technology involved in maintaining a "trove" -- seem to be a natural continuation of present-day ideas around lifelogging, and personal digital archives. 

Image credit: Wikimedia

Topics: Storage, Emerging Tech

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Capacity and Bandwidth

    I can safely store a huge amount of data, all under my control and not subject to the whim of a cloud corporation, in my local bank vault (and I do just that). Given the dearth of upload bandwidth storing, lets say, 2tB on a hard drive and driving it to the bank in the Subaru, versus sending it up the wire to the cloud the speed of the former is easily an order of magnitude better. It is probably faster than that but I'll accept ten times faster. It is certainly cheap enough once the drives are purchased. Given the present state of the art and uncertainty about cloud provider's legal and moral responsibilities, I'll keep my data under my control on the six drives that cycle in and out of the vault.
    • Exactly

      I have four 1.5 to 2 TB drives for redundant backup at home and two small form factor drives that cycle in and out of the (free) safe deposit box at the bank. As for the author's idea that some big corporation will never lose your data (or give it to someone without your consent or knowledge) that is naive beyond belief. No system is perfect, but I trust my bank more than I trust the cloud.
    • Rain?

      The one advantage of the cloud (hopefully) is their backup of my "personal data".
      Upload it every night.
      But I also keep a copy on my local file server, and I access all the files from local storage.
      • Backup

        I mean that the data stored "in the cloud" is hopefully backed up by the service provider.
    • safe deposit boxes

      Safe deposit boxes are much cheaper than many might suspect them to be, and even the small ones can store several hard drives.
  • I have the answer

    I got it from Robin Harris on ZDNET.
    A few others have mentioned it from time to time.
    Most sheep on ZDNET keep droning on about the usual expensive incumbents like DROPBOX or DROBO or AMAZON or saying how wonderful SkyDrive is ... but haven't thought the problem through.

    The reason symform is vastly preferable to all other solutions is that it has the best ARCHITECTURE for the cloud.
    Indeed it is a proper cloud solution.
    Instead of grabbing all your data, violating your privacy and charging you megabucks ... symform implements a lightweight, cheap architecture.
    You won't get the same from the usual greedy incumbents because that doesn't retain technology efficiencies and maintain corporate revenue streams.

    Once you have the idea of RAID (designed in 1988) and cloud (say 2008?) ... you put the two together.
    Instead of moving all the data from the edge of the network - where it can be stored cheaply if somewhat insecurely - to a humungous datacentre and charging the earth ... you maintain an INDEX of the data securely in the cloud and implement RAID on the (cheap) edge.

    I know its the right answer for a few reasons:
    - I think
    - I am not a sheep
    - the big incumbents, wanting a cloud for themselves, but not willing to pay business prices for enterprise kit ... use commodity components e.g. GOOG
    - MSFT know it's the right answer ...
    - ... so the people who told MSFT it was the right answer had to leave MSFT to found symform because MSFT couldn't make any money out of the right answer! (MSFT instead released crap like Windows Home Server and decent things like Skdrive WITH CAPACITY LIMITATIONS)
    - I could buy HP Microservers in the UK for about $150 on special offer.

    symform is not ideal.
    The company isn't big enough (yet).
    I'd like a UNIX variant.
    I'd like ZFS!!
    I'd like a better UI ... and so would the typical consumer for your idea of a personal digital archive.

    Bandwidth isn't a problem ... or if it is the cloud isn't going to work either.

    The idea that 'the cloud' is a some sort of fancy configuration of datacentres owned by a major corporation is wrong ...
    ... just like the idea that the Internet is owned by a company is wrong.
    Our home PC's (make that 'devices') are part of the cloud.
    We don't want monolithic, proprietary expensive designs owned by corporations; we want lightweight, flexible, cheap solutions ... in out hands and more importantly within out control!
    • More succintly

      1. The scale problem is solved by keeping data at the edge of the network.
      2. The cost problem is solved by keeping the data at the edge of the network on commodity hardware.
      3. The security problem is solved by keeping an INDEX in the cloud.
      4. The privacy problem is solved by keeping the data encrypted when distributing via the cloud.
      5. The control problem is solved keeping control!
      6. There is no bandwidth problem.
      7. The problem of avoiding GOOG, AMZN, MSFT AAPL ... is solved by avoiding GOOG, AMZN. MSFT, AAPL.

      OK, we need more work on the UI ... that's not a difficult IT problem: the difficulty is that the incumbents can't make any money out of it! So we need to push them.
    • Hey johnfen!

      You could have made your point without calling people who you don't agree with "sheep" and putting them down.
      As soon as I read the first paragraph, I stopped reading.

      May I recommend "How to make friends and influence people"?
      It applies to discussion boards, even when you're anonymous.
      • Re: without calling people who you don't agree with "sheep" and putting the

        Who did he call "sheep"? He didn't mention anybody.
  • You have overlooked something

    Yes, it is a problem, but yes, there are solutions.

    For example, CrashPlan offers a very reasonably priced UNLIMITED data family plan (10 computers).

    And of course, there's always the drive in the bank vault solution (though 2 or even 3 would be best).
  • Anything is better than nothing

    I retired from IT and fix friends PCs as a hobby. I always tell them to have some kind of back up in place. Most do not and are so upset when their system dies - especially the hard drive and I have to tell them their data is gone. The discussions in this post are very thought provoking, but for most people they don't have even a concept of backing up their data until it is too late.
  • Duplicate external drives

    I have two 1 TB external drives which I store my larger, long term backups. One is always offsite (similar to the gentleman above but I store it at my mother-in-law's..) It's not perfect and not always up to date but it is cheap.

    I do look forward to much cheaper, much larger cloud storage solutions but I won't ever completely trust them so I'll always have an external drive stored in the (detached) garage or somewhere, just in case.
  • And the next generation will....

    And the next generation will want for the days, months and years of pouring over your data, your life, as though theirs is meaningless. In other words, so much stuff which overwhelms and will never be gone over is best kept aside, with a special place for say best of images and/or text. Most people do not have a terabyte worth of life which would be all that interesting. Next generation may choose to hit the delete button ;) I see collections of a decades worth of emails -- how many read those old emails?
  • Have you ever read your terms of service?

    "your bank never loses your bank accounts. Microsoft, Google, Amazon, Apple, etc will never lose your personal archive for the same technical and philosophical reasons"

    That is a very immature and uninformed opinion. Your terms of service for any consumer grade online storage will specifically state that you have no legal recourse if they lose your data.
  • Personal Digital Archive - Lifemap (

    Great post and definitely not discussed enough in mainstream or tech media.
    We're building Lifemap (, currently in beta, for many reasons you describe, and more.

    A Lifemap is a personal digital archive to permanently organize family memories into life stories and a lasting legacy.
    We designed Lifemap from the ground up to tackle some major issues we see affecting families in the not-so-distant future:
    - we're taking more photos than ever, everyone has a camera in their pocket. More videos, more journal/ diaries with tweets, status updates, etc. Not all are created equal and rarely curated and useful after creation.
    - we're storing and sharing across the ever-evolving and fragmented web and devices & many companies use your memories as loss leaders to sell you something and hold your memories behind their walled garden. Memories are time dependent and tech agnostic and should be platform agnostic. and controlled by the user
    - we have at most 2 or 3 generations that have photographic memories within our families, mostly in analogue and increasingly digital, something I believe our generation needs to protect and organize as we start to inherit them from grandparents and parents (our entire family and childhood memories). We have an eBeneficiary system that allows you to appoint a spouse or child to inherit your (organized) Lifemap when the time comes.
    - back-up is not archiving. Social is not archiving. To us an archive needs to be accessible, useful and liberate your memories, and having an entire lifetime of album or folders for every recurring event each year + one-offs will become unwieldy (Halloween, thanksgiving, Christmas, kids birthdays, holidays, vacations, etc. etc.) so we have designed our app for long-term intuitive organization specific to your life.
    An archive should be an archive and nothing more and the terms of use need to be written in favor of the users and promise continuity of the product by charging money and being accountable as a longterm custodian of your most valuable and emotional assets.

    Here is our intro video:

    We're releasing a completely redesigned and fully functional iPhone and iPad app next week as well as our premium plans. You can sign up to check out our beta and receive 2GB of free space to play with.

    Sorry for the length, but we're very passionate about providing a private sanctuary for families to sleep easy knowing that their memories are protected, accessible and organized for their whole life.

  • Because every memory matters

    I know I'm probably gonna get a lot of high-brow techie heat for this, but I want to put it out there any way. This is all a lot of complicated talk about something so simple: PEOPLE WANT TO KEEP THEIR MEMORIES ALIVE.

    I think the issue is not with the available storage space in the Cloud or in our hardware, but more on the way people try to keep their memories. It's often an all-or-nothing storage approach. You back up ALL your data, you archive ALL your photos, you upload ALL your videos... I mean really, if we ALL store ALL of what we have, we're really gonna encounter ALL those issues that ALL of you have mentioned.

    I guess people could just try to get in the habit of logging/recording/keeping their memories piece by piece, and type by type. Some memories only need a photo or two to keep alive, other need a 2 minute video, some can be immortalized even by a simple sentence. What we need is a platform that lets us do that-- record our memories in not just one way but ALL ways.

    Here, check this out: