X
Business

Why Apache Solr search is on the rise and why it's going solo

The idea of using Apache Solr as a standalone primary data store may not be new but it is building momentum, according to Lucidworks' Will Hayes, CEO of the open-source search engine's commercial sponsor.
Written by Toby Wolpe, Contributor
willhayeslucidworks2015jana220x287.jpg
Will Hayes: Using Solr more like a NoSQL store.
Image: Lucidworks

Apache Solr has been thriving for years as a traditional enterprise search engine operating on top of relational databases and frameworks such as Hadoop. But it's Solr's secondary role as a standalone NoSQL store that's set to expand rapidly in 2015.

Until recently, only about a fifth of Solr deployments seen by Lucidworks, the open-source technology's commercial sponsor, involved the search engine being used in this emerging role, according to the company's CEO, Will Hayes.

"Most Solr deployments still are in complementing or extending other data stores. If you had a gun to my head, I'd probably say that still 65 to 70 percent of [Solr] search is deployed in a very traditional fashion," he said.

"But a year ago we would have seen maybe 20 percent of the pipeline consist of these next-generation, Solr-as-the-data-store [deployments]. Today about 50 percent of my pipeline going into 2015 is now search as a first-class data store for your non-traditional use cases.

"This is no longer intranet and knowledge base. It's really providing a data service across the enterprise. This wave, which really started in the past year and a half to two years, is growing rapidly."

Hayes said in these cases Solr is effectively serving as a data-access layer for doing key-value lookups as well as making the data fully indexed and searchable.

He cited the example of one of the largest consumer clouds, which he declined to name, that is using Solr to synchronise services between different devices to store user information and preferences, in the same way a traditional NoSQL store would be used for key-value lookups.

"All [an individual's] different preferences and settings - they're finding the need to make all of that searchable first and foremost. Obviously, then it needs to scale and the ability of the search engine to maintain stability across hundreds of millions of requests a day - in some cases these are exceeding a billion - is very attractive," Hayes said.

"Search becomes a first-class requirement in a lot of these deployments, which is very difficult to implement when you're dealing with more traditional NoSQL stores. It's pushing organisations to use the search engine more like a NoSQL store for that robustness and flexibility."

"So, in a particular service, if I were to log in, say, from my mobile device, update something in my calendar and then want to persist that and sync that back to, say, a laptop or another computing environment, the search engine can maintain that state between these devices and be used as that data store to drive it."

A key event that has helped Solr shift from traditional search to standalone data store was the launch in October 2012 of the flexible distributed search and indexing offered by SolrCloud.

"Before SolrCloud, one of the issues with search was it did not scale across commoditised hardware the same way that these MongoDBs and these other technologies did," Hayes said.

"That was definitely an issue for organisations that were looking to distribute different environments and geographies, as well as just to take advantage of virtualisation and those kinds of things."

Another aspect of Solr that is proving an attraction is that the governance model it enforces ensures the accuracy of records.

"With a lot of those platforms, it's a sort of a fuzzy match that says, 'Hey, you know I've got a distribution here and one record is saying this and another one says that, I'm simply going to go with the one with the latest time stamp'. That works in many cases but there's not a lot of governance there. There's no reliability," Hayes said.

However, SolrCloud's use of Apache ZooKeeper to handle distribution puts a governance model behind the data retrieval process to guarantee delivery.

"So when we talk about Solr, it's all your data, all the time at scale. It's not just a guess that we think is likely the right answer. 'We're going to go ahead and push this one forward'. We guarantee the quality of those results. In financial services and other areas where guarantees are important, that makes Solr attractive," Hayes said.

According to Lucidworks, the company contributes between 70 and 75 percent of any given release of Apache Solr and employs about a third of the active committers to the project.

What typifies the environments where Solr is being used as the primary data store is the high number of reads, according to Hayes.

"When you're thinking about a consumer service or you're going to be serving up hundreds of millions if not billions of requests a day, that becomes a very difficult problem in terms of scale," he said.

"This is where we've seen more and more people who are kicking the tyres with the search engine as being the primary data source. The difference with the NoSQL data sources in traditional roles is they're optimised around lots and lots of writes. You're constantly updating. This is why Facebook invented Cassandra, Because you're having to store all your status updates and these different types of transactions."

Consequently, that is why it is financial services environments - where, for example, trading information is being served up for use in high-frequency dealing or by hedge funds - that are using the search engine as the primary data store.

"They no longer want to persist it to other places and then index it. They want the search engine serving up those requests because they're dealing in such high volumes," Hayes said.

The Solr community has made major strides in strengthening the technology for these types of roles, and deep paging, available since early 2014, is one of the fruits of that effort.

"When you're doing more programmatic retrieval of data, you might want something off page 900,050. You want that to come back as quickly as something on the first page," Hayes said.

"That's why we came up with this approach of deep paging. Search wasn't optimised for this because that wasn't really a traditional use case. That was a big investment we made to harden Solr for that sort of deployment."

Being able to run analytics off search results in large, highly-distributed environments has been another major area of activity. The goal was to create the ability to conduct distributed faceting - running analytics across various nodes and then merge the results to produce accurate results.

"The beauty of doing it that way was we were able to take advantage of the governance model again, which makes us a lot more accurate than even the SQL data stores when it comes to running these types of aggregation across distributed environments," Hayes said.

Last September, Lucidworks launched its proprietary Fusion management tool, which sits on top of existing Solr apps and aims to help enterprises develop search software by adding more advanced features.

Along with machine learning for the auto-discovery of data trends and advances in log analysis, an important area for future Solr development lies in making it easier for subject matter experts to get involved in the data.

"In areas such as financial services, an analyst wants to come in and classify a series of events as being potentially fraudulent. They want to persist that discovery back into the engine, so that the engine can start to flag additional transactions that are coming in that are fraudulent," Hayes said.

"Our job is to build those interfaces that really enable those users to access that advanced functionality. So we have a lot of focus on workflows where we can bring anything from an e-commerce to a fraud use case and put a subject-matter expert in front of the data.

"Because that's who really can drive your data initiative forward: people who understand the meaning and the context. Our job in development is really to lower the bar to achieve these really advanced unique use cases."

More on databases

Editorial standards