The embedded search investment

Search is more than a technical check mark for your product--it's an investment. You need to pay attention to the desing of the embedded search platform itself, says Attivio's Andrew McKay.

"Embedded search" (also referred to as “OEM search”) is the integration or embedding of search technology into another software application that is itself a commercial product or service. In many ways, the search provider is an extension of the OEM’s R&D department, creating not just a commercial relationship, but a true partnership.

Embedded search: A history
One of the earliest vendors was AltaVista, and you have to go back over ten years when it started to really appreciate its legacy as an embedded technology. AltaVista was known mainly for its public web portal, but it was also a favorite search platform for embedding in other applications. Many software companies today still use AltaVista for their search technology and have yet to find a viable alternative despite its outdated functionality.

The history of embedded search has not been pretty. In the mid 90s, Verity earned most of the market share by aggressively investing in its partner program, but this investment was always a struggle as a result of difficult customer demands and technology issues (e.g., scale). AltaVista joined the fray with a smaller, more agile alternative, but it had its own problems. Its rapid succession of owners (Digital, Compaq, CMGI, and Overture) and eventual split in two, with its source code going to Yahoo! and its tool kit and distribution rights going to Fast Search & Transfer (FAST), eventually killed it outright.

In 2005 Verity followed suit and fell on its sword by selling to its No. 1 rival, Autonomy. By then Verity had already lost much of its embedded business to FAST. Autonomy seemed to have little regard for the market, so FAST eventually took over. Now FAST appears to be exiting the OEM business, a decision made apparent by its sale to Microsoft.

A number of smaller search vendors recently jumped in to the market, most notably Lucene, the open source search standard. No one has yet "taken over," however. Not surprisingly, AltaVista is still seen by many as a viable solution. The problem it has now is that its functionality is outdated and it has no product support: Yahoo has clearly decided to shut down this line of business.

If you are one of the innovative software vendors who understood from the beginning that search is an important part of your product, you have probably been abandoned by several search vendors at this point. The response from the search community has not helped either. Safe harbor policies that entice you to move to yet another search technology trivialize your past investments in product strategy, development costs, and operational tasks such as training and support.

Search is an investment
It is an interesting phenomenon with search technology that if you simply sit on your current search implementation as is, your customers will perceive a gradual drop in the quality of the results. There are two reasons: your index continually grows in size so there is more content to search and the user’s expectations of relevance increases as they become more familiar with their content and your system. This means you need a progressive search strategy that embraces constant innovation and improvement in areas such as query fuzziness, relevancy, entity extraction, clustering, and alerting.

For the embedded search vendor the problem is more acute because it is not the single organization that must contend with the management of its content, it is every one of your customers. A single deployment of a search engine has the luxury of the dedicated search expert who can constantly tune the system for improvement. You, the application vendor, would need an expert for every one of your customers for this model to work, which is not a practical solution.

The answer lies both with the design of the embedded search platform itself and the support from the organization that develops it. Surprisingly, most of today’s enterprise search vendors struggle with this issue. Perhaps this is because they began their evolution as something else: web search, news search, ecommerce, etc. It is not easy to morph one’s architecture or operations to meet the requirements of the embedded customer. Lucene has a different problem: it lacks needed functionality. Adopters begin with the expectation Lucene provides 80 percent of their solution when in fact it is more like 20 percent. You become an enterprise search software company along with your current business in order to develop what you need. Add to this problem lack of consistent timely support, testing, upgrades, and a legal corporate entity to go to if the product is not performing as it should, and you have a compelling argument for a commercial alternative.

Here are the functional requirements one should consider for a successful (and lasting) embedded search platform:

Flexibility--The enduring attraction of AltaVista is its flexibility. The real strength of an embedded search platform lies in its ability to tune behavior to the specifics of the application, which may range from the need for 100 percent precision and consistency (as in legal discovery) to web search-like environments where relevancy is king. This can only be had if the embedded search technology supports full programmatic manipulation, where the developer has total control of the product and can integrate it intimately with the application.

Platform Footprint--The embedded search platform must be frugal with resources. For instance, it should not occupy the majority of an application’s footprint on disk. Look for a vendor who can fit their core in something less than 20MB of disk without compromising features.

Content Density--The amount of content that is generated and stored within the enterprise is not decreasing. In fact it is growing at a faster rate each year. Your customers--not you--buy the hardware to support the search index, so packing as much content on a single server as possible (content density) and then maintaining linear growth is critical to your success. A good yardstick for content density would be 100 million standard size documents (e.g., 5KB) per one standard size production server (e.g., 16GB memory, 4 CPUs) at 1000 documents indexed per second. Also, expect the index to be 20 to 40 percent in size from the original source (and less if there is non-indexed content in the source). Look for vendors who are willing to publish their numbers. It usually means they can back them up.

Incremental Scalability--Linear scalability is a requirement, and it is a simple concept: if you need twice the capacity of the one-machine limit, you only buy one more machine. The implication is that the linearity extrapolates far enough out to support the most demanding requirements. Another more subtle but important consideration is the need to buy the hardware up front to accommodate the “theoretical full size” of your index. It would be nice if your customers were able to avoid buying the hardware until they actually need it, and be able to add it to the system without having to re-index. This is incremental scalability.

Ease of Use--It seems to be a prevailing assumption that ease of use and functional sophistication are mutually exclusive. This does not need to be the case, but both must be designed into the product from the beginning. Ease of use is more than simply providing graphic tools, although they are a necessary part of the product. For instance, while it may be true most search vendors today support an API layer, supporting one API for all ingestion, query manipulation, and results processing rather than two or three is easier to use.

Ease of Deployment--You want a system that is up and running consistently in seconds, not hours, and can shut down just as easily and quickly. You want a system that can run anywhere in any environment (having it entirely written in Java is your safest bet). And you want a system where you don’t have to re-index. It is amazing how this requirement gets overlooked when you consider the cost of down time in the embedded environment: re-indexing on large volumes can take weeks. This is why it is also important to support recovery, redundancy, and active (rather than passive) fail-over.

Comprehensive Search--Achieving the previous requirements means little if it compromises the search functionality itself. There are a number of vendors who provide simple keyword search capabilities who are by comparison small, efficient, and quickly deployed. But what do you get for it? Comprehensive search capabilities include more than keyword management. Here is functionality you should insist on:

• Fuzzy querying: spell check, stemming, stop words, phrase detection, proximity operators
• Probabilistic relevancy: TF-IDF field weighting, date, price, proximity, anti-phrasing
• Multiple relevancy profiles: user, geographic, content
• Automatic, dynamic facet (navigator) generation
• Auto-categorization, entity extraction
• Alerting and binding to actions
• Real-time processing
• Full security model

Should you switch your current embedded search technology? The answer depends of course on whether or not your current technology is doing the job well. Do not get lulled into the complacency of, “If it ain’t broke, don’t fix it.” This does not work for a progressive search strategy, and you should have one. It is likely your competitors do. Take a look at the latest technologies, and if you do indeed switch, plan the process as if your product strategy depends on it.

Don’t fall trap to sales promotion gimmicks that promise you a transition program to their solutions. Search is more than a technical check mark for your product; it is strategic and you want a search vendor that can support a long term relationship with you. A good place to start is by selecting one who has designed their search technology from the ground up as an embedded search platform.

Andrew McKay is senior vice president of products for Attivio.