Austin, Texas-based startup Pilosa has announced the launch of the community edition of its "distributed bitmap index" aimed at dramatically improving querying speeds on datasets greater than 1TB without purchasing additional hardware.
Pilosa CEO Higinio Maycotte told ZDNet that read speeds are much slower than write speeds because advancements that have been made to database capabilities over the last five years are largely around our ability to store data.
However, Maycotte noted that databases have two components: Storage and retrieval.
What Pilosa has done is "liberated" the index -- which is used to run queries on datasets -- from the storage, creating a new type of bitmap index that runs in-memory rather than on disk.
Troy Lanier, VP of product, explained that because it's a bitmap index, Pilosa records relationships about the data rather than the data itself, and so is significantly smaller.
"Computers work in 1s and 0s, so [Pilosa] really boils it down to the base unit that computers work in," Lanier said.
Maycotte emphasised that Pilosa is not looking to "displace" anything; rather, it sits on top of data stores.
"We turn data into a bitmap index that sits on top of whatever storage system it originally went into without replacing it. By doing so, instead of querying the old system, you query Pilosa, and Pilosa responds in exponentially less time than an underlying database would," he added.
Pilosa uses a separate API endpoint for importing so users are able to write code that reads data out of Cassandra, or pull a large dataset of IP addresses off the wire and push the data into the Pilosa endpoint.
The company said it supports all data sources and is developing an SDK to help the open-source community write connectors to their chosen data sources.
While historically, a bitmap index would be used on low-cardinality columns -- that is, columns with few unique values -- Pilosa claims it has found a way to optimise the old and largely abandoned technology so that it can be used on high-cardinality data, at scale.
It recently analysed 1.3 billion cab rides in New York City -- each record had 100 different attributes such as time, distance, and route -- to pinpoint the fastest pick-up destinations during rush hour and results came back in less than two seconds, according to Pilosa.
"We think it's one of the fastest database access tools on Earth," Maycotte said.
According to Pilosa, its distributed bitmap index can make a terabyte of data respond to queries as if they were just 10 megabytes. The startup claims no test has exceeded 1.8 seconds, with most queries returning in less than a second.
Pilosa noted that traditional software such as ElasticSearch and Neo4j "are great, but they fail at scale", typically breaking down at about 500GB.
While many data scientists or analysts that work with big data try to address this by condensing their data or using expensive hardware, Pilosa said "you can only go so far to optimise storage".
"There is some really exotic hardware out there that's becoming popular like GPUs and FPGAs (Field-Programmable Gate Arrays). We were able to approximate the speeds of GPUs using commodity cloud hardware and I think the benchmark is somewhere around 2 billion edges per second using just typical servers from Amazon," Maycotte said.
Pilosa also has an API that allow users to incorporate streaming data while executing queries across existing data.
The company believes what it has created will allow data scientists to put their skills to better use.
"It's unfortunate that data scientists today spend so much of their time cleaning up data. They work off samples because it's hard to deploy the infrastructure they need to work on entire datasets," Maycotte explained.
Pilosa is looking to target three verticals: Smart cities, bioinformatics, and information security.
Maycotte said he is particularly excited about the latter.
"Technology is not keeping up with the rate of attacks and harmful agents out there. We think technologies like Pilosa will allow machine learning algorithms and artificial intelligence to operate on entire transaction flows, on the continuous arrival of data, on massive historical datasets so that they can be as accurate and fast as possible," Maycotte said.
"Today, [scientists are] not able to leverage all of the historical data because it's just too big."
While Pilosa plans to focus on its "community edition", available on Github, for the next 18 to 24 months, enterprise and cloud editions are in the pipeline.
The enterprise edition will have additional security and administrative controls, as well as custom integrations to proprietary data sources. Pricing has not been determined for the enterprise edition, but will likely revolve around volume, the company said.
The cloud edition, which Maycotte said he's particularly excited about, will mean that users will not have to own, deploy, or manage any infrastructure.
"Imagine a world where these massive databases take a HDFS [Hadoop Distributed File System] data lake, for example, that is storing massive amounts of data and it is just really slow to get that data out. A developer, an engineer, or a data scientist can just go and add a few lines of code, log into a console, and watch the Pilosa index build in the cloud edition," he said. "Within a few hours, they can have instantaneous access to the data. Their machine learning models, their applications, will have ready access to the underlying data."
Spun out of Umbel, a data management platform for sports and entertainment companies, Pilosa's technology is three years in the making, though the company was founded in mid-January this year.
No capital has been raised to date, though Pilosa is open to the prospect of doing so later down the track.