Space: the big data frontier

The Large Synoptic Survey Telescope (LSST) project will produce ten petabytes of data every year, furthering our knowledge of the universe, and studies of data science. The National Science Board will decide whether to approve LSST funding this month.
Written by Mari Silbey, Contributor
Rendering of the LSST facility at night by Todd Mason

The Large Synoptic Survey Telescope (LSST) project faces a major milestone this month. The National Science Board will pass down its decision on whether to fund the next phase of LSST construction as part of the 2014 fiscal budget. At issue are hundreds of millions of dollars, and the fate of a three-gigapixel camera designed to take nightly pictures of the sky. With the success of several critical project reviews in the last eight months, there is every indication that LSST will get the green light. But how much the project will tell us about our solar system, the dark energy problem and more, will depend on how well we can process the information the telescope and its camera send back to us - an estimated sum of around ten petabytes of data per year.

Astronomy is a natural fit for big data science. The infinite frontier, the fact that data doesn't have to be protected because of privacy or financial concerns, and the wide scope of heterogeneous information that studies of the sky provide all make astronomy the perfect sandbox for big data discovery. They also expose the difficulties of managing big data and deriving useful conclusions from heaps of structured and unstructured information.

Astrophysicist and Chair of Information and Statistics for LSST Kirk Borne has surveyed the sky for his entire career, and he says we don't have a big data problem. Data storage isn't a problem. The volume of data isn't a problem. Our problem is pulling meaningful insights out of the data avalanche. It's an issue across every field where big data makes itself available. From the financial markets to homeland security and medical research, massive amounts of data are only valuable if we can form conclusions from our findings and affect beneficial change.

In the case of the LSST project, scientists have tasked themselves with four areas of study: an inventory of our solar system, mapping the Milky Way, understanding the transient universe, and discovering the impacts and possible causes of the dark energy phenomenon. Using this assignment list as a guide, scientists hope to uncover new truths about our dynamic universe and turn detailed photos of the night sky into signposts for future space exploration.

How big is LSST's big data?

Rendering of the LSST telescope from a side view

Astronomers will use the new synoptic telescope to view and photograph a quarter of visible space every night starting sometime early in the next decade. The nightly survey is designed so that scientists will get a record of every change that occurs between passes of the telescope's camera, including changes in the brightness and position of a wide variety of artifacts in space.

Says Borne:

Astronomers not only study the sky as individual objects one at a time, but also study the sky as a whole, which we call sky surveys... The idea with LSST that makes it unique from past surveys is that it's also a time-domain survey. We don't just take images of every patch [of sky] and be done with it; we take repeat images of every patch.

The synoptic telescope will take two pictures in one location every sixty seconds. It will use fifteen seconds to grab each image, and another five seconds each to read the images. There's another twenty seconds built into the timeline for processing change data, and, after one minute, the telescope will be in position to start the process all over again at its next location. The sequence will continue for twelve hours, creating a nightly round of cosmic cinematography.

Simulated LSST image of space

The data that the nightly survey will collect can be measured in several ways. Each photo will contain six gigabytes of information, which a data processing center will then read and process to create millions of near-real-time change alerts every night. After the first pass of the data is concluded, computers will then use the following twelve hours of daylight to do a deeper analysis of the information collected. Finally, the project will deliver data releases each year with a more comprehensive collection of information, including the highest resolution photos available, for broader scientific study. After ten years, LSST is expected to produce a total data volume after processing of several hundred petabytes.

Because of what he saw on the horizon for astronomy, Borne began sounding the big data clarion call more than a decade ago. He recognized not just how much data new astronomy studies would produce, but also how much analysis would be required. The Sloan Digital Sky Survey, a study of the sky conducted between 2000 and 2008, has produced roughly twenty thousand academic papers in the years since, and the LSST project will do the equivalent of a full Sloan survey every three nights. There aren't enough graduate students and PhDs in the world to review all of that information.

From Borne:

Imagine if I gave [a] student 600,000 CDs of data, and I said come back tomorrow and I'll have 600,000 more for you. And come back the next day, and the next day, and the next day for the next ten years. Every one of those days I'll have 600,000 more for you.

The world of astronomy wasn't ready to hear Borne's warning when he first started sounding off about big data. However, computer scientists proved to be a more receptive audience, and many were already doing their own investigative work on big data, and how to process large volumes of information at rapid speed. For them, the LSST project represented a unique opportunity. Not only would the new telescope create a wealth of data, it would create a wealth of data that was freely available for study, without privacy or security restrictions. From that, computer scientists knew they could improve their own techniques for algorithmic analysis, and offer strategies in return for use in numerous academic fields facing impending data overload.

Challenges in big data analysis

Rendering of the LSST facility with the night sky by Todd Mason

The process for any data analysis involves reviewing raw input, comparing data points within a set, and correlating information with other data sets produced separately. With the LSST study, astronomers will, for example, use the telescope's nightly images to identify distant supernovae and gather more evidence of dark energy, and how it's causing the universe to accelerate its expansion. Collecting the information isn't enough, however, and astronomers will not only have to identify changes and patterns, but also put it in the context of other available data.

Unfortunately, there's a big logistical problem in managing so much information. The academic community doesn't exist in one place, which means LSST data will have to be shuttled around the world for analysis. The new synoptic telescope is being built on Cerro Pachon, a mountain in Chile, but the images it collects will be transported nightly to North America and up to the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. That will require a huge amount of bandwidth, and bandwidth is increasingly a scarce resource in our data-hungry world.

Moore's law observes that computing power doubles every one and a half to two years. However, bandwidth isn't on course to increase at nearly the same pace. And that, says Borne, is a problem:

The number one big data challenge in my mind is the bandwidth problem; it's just having a pipe big enough to move data fast enough to whatever end-user application there is.

Borne's worry about the bandwidth dilemma is echoed by many, many other data scientists today. At at big data workshop just outside of Washington D.C. in June, Ian Foster, a senior scientist at the Argonne National Laboratory highlighted the importance of data transport in virtually every field of scientific research, from metagenomics to climate science. While sufficient bandwidth for moving big data around may be available in a few select geographic regions, it certainly doesn't exist everywhere researchers reside. Partly in response to that problem, scientists have formed the Globus Alliance, a group of organizations working together to develop software services for distributed data and computing management. That alliance launched a service called Globus Online in 2010, which scientists now use to move large data sets securely from location to location. In less than two years, Globas Online already has more than five thousand registered users who have moved more than five petabytes of data.

At the same big data conference, Lucy Nowell, a computer scientist at the United States Department of Energy hammered home the cost that bandwidth is imposing on the evolution of computing as a whole. She described power consumption as "eating us alive" from a budget perspective, with data transport dominating power costs. If we can't find solutions, the sheer expense of moving data around will significantly inhibit our ability to learn from the masses of information we collect.

Solutions in data science

There are no simple answers for the problems that data science is uncovering. However, two primary strategies have emerged for dealing with the bandwidth issue. First, both public and private institutions are working on improving network infrastructure to speed data delivery and improve research collaborations. The National Science Foundation recently committed twenty million dollars to the U.S. Ignite program, which is dedicated to proving the benefits of next-generation broadband networks. And the Gigabit Neighborhood Gateway program is following a public/private partnership model to try to develop a new base of communities across the United States with gigabit broadband networks.

Aside from improving infrastructure, the second approach to dealing with bandwidth scarcity is improving algorithmic analysis. As Neal Ziring, the technical director for the Information Assurance Directorate at the National Security Association put it at last month's big data event, individual computing installations will never be big enough to do everything we want, which means data will always be distributed. We need better algorithms for dealing with distributed data.

The way algorithms help minimize the bandwidth issue of distributed data is by reducing the amount of information that needs to be transported for study. For instance, instead of transferring an entire set of raw data, scientists can employ relatively simple algorithms to reduce data to a more manageable size. Algorithms can separate signal from noise, eliminate duplicate data, index information, and catalog where change occurs. Any of these data subsets are inherently smaller and therefore easier to transport than the raw data from which they emerge.

Improvements in algorithmic analysis are taking place everywhere, with teams of scientists dedicated to dealing with the bandwidth challenge, and to solving other big data conundrums like information classification and outlier detection. Borne is enthusiastic about the cooperation, and - even beyond issues of bandwidth management - he believes everyone can benefit from the joint efforts taking place. Outlier detection algorithms, for example, can be as valuable for detecting financial fraud or precursors to disease as they are for pinpointing objects and other phenomena in outer space.

With the LSST project in development, there will be significant new opportunities to learn from and experiment with programming and analysis strategies; from defining information hierarchies to improving data visualization. That means the synoptic telescope will provide not only broad astronomy insights, but the potential for major data science discoveries as well. It offers a frontier for research across multiple disciplines, and an open look at the feats of accomplishment we can achieve using big data, and supercomputing power.

The future of LSST

Members of the LSST team celebrating the successful casting of the telescope's 27.5-foot-diameter mirror blank - photo by Howard Lester

The synoptic telescope isn't expected to be operational until 2021, but private donors have ensured that work is progressing on the telescope even before public funding has been formally secured. Construction has started on the telescope's mirror, and if the National Science Board approves funding this month, the hope is that engineers will be able to start on the camera in 2014.

Says Borne:

I started working on this project seven years ago, and we thought we were close. Now seven years later, we're close for real.

Borne hopes that everything goes according to schedule so that he hasn't retired by the time the synoptic telescope is operational. The images from LSST will provide many lifetimes of study. While Borne will only be able to analyze a fraction of a fraction of the data, even that amount will be reward for his decades of dedication to astronomy. LSST promises entire new dimensions of discovery, both in astronomy and in data science. This is no small step for man. It could very well be a large leap for mankind.

All images courtesy of the LSST Corporation

This post was originally published on Smartplanet.com

Editorial standards