Researchers at the Massachusetts Institute of Technology are building a protoype interactive database for big data applications that could deliver information from huge datasets almost immediately.
Currently, data that needs to be analysed quickly would be stored in a computer’s main memory, or dynamic random access memory (DRAM) — but datasets now being produced are too large for that.
The researchers are developing a system that uses multiple nodes across an ethernet network.
“If we’re fast enough, if we add the right number of nodes to give us enough bandwidth, we can analyze high-volume scientific data at around 30 frames per second, allowing us to answer user queries at very low latencies, making the system seem real-time,” Sang-Woo Jun, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT says.
“That would give us an interactive database.”
Currently, information tends to be stored on multiple hard disks on machines across an Ethernet network. However, this architecture increases the time it takes to access the information, Jun says.
“And if the data does not fit in DRAM, you have to go to secondary storage — hard disks, possibly connected over a network — which is very slow indeed.”
Jun, fellow CSAIL graduate student Ming Liu, and Professor Arvind, the Charles W. and Jennifer C. Johnson Professor of Electrical Engineering and Computer Science, have developed BlueDBM — or Blue Database Machine — to be presented in February at the International Symposium on Field-Programmable Gate Arrays in Monterey, California, based on a network of flash storage devices.
In BlueDBM each flash device is connected to a field-programmable gate array (FPGA) chip to create a node. The FPGAs not only control the flash device, but also perform processing operations on the data itself, Jun says.
“This means we can do some processing close to where the data is [being stored], so we don’t always have to move all of the data to the machine to work on it,” he says.
FPGA chips can be linked together using a high-performance serial network, which has a very low latency, or time delay, meaning information from any of the nodes can be accessed within a few nanoseconds.
Using multiple nodes allows the team to get the same bandwidth and performance from their storage network as far more expensive machines, he adds.
The team has been working with data from a simulation of the universe generated by researchers at the University of Washington. The simulation contains data on all the particles in the universe, across different points in time.
“Scientists need to query this rather enormous dataset to track which particles are interacting with which other particles, but running those kind of queries is time-consuming,” Jun says. “We hope to provide a real-time interface that scientists can use to look at the information more easily.”