'

Beyond findability: The search for active intelligence

It took a decade of indexing advances such as skip lists and index compression to make indexing practical, and another decade of computing advances to give us billions of searchable documents says Attivio's Jonathan Young.

Commentary--It seems as though there is a watershed event in the search industry every ten years or so. Although Lexis-Nexis first commercialized search in the 1970s, it took a decade of indexing advances such as skip lists and index compression to make indexing practical, and another decade of computing advances to give us billions of searchable documents on the Internet.

Ten years ago, Google totally changed the face of what was then the emerging concept of Web search using better ranking algorithms based on website popularity. This precipitated the bifurcation of the search market into two segments: Web search and enterprise search. As the Web search space came to be dominated by Google, the old guard (e.g. Verity, Autonomy, AltaVista, and FAST) turned to enterprise search.

Google brought two changes to the industry. First, it raised the standard and importance of ease of use for the end user. By establishing the search box standard for the Web, it set the bar for enterprise search as well. Enterprise search in general does not have a strong track record: stories of multi-year deployments, cost overruns, and dissatisfied users are rife in the community.

Second, Google brought search to the forefront of information access as a strategy in the enterprise. The statement, “Why can’t I find information in my company as easy as I can find it on Google?” became all too common, echoing from the cubicles of knowledge workers. The average knowledge worker uses Google search more than their own internal knowledge applications. Not surprising, perhaps, there are signs that the enterprise search market is entering another watershed:

• Significant dissatisfaction with current solutions: according to AIIM, 85 percent of respondents in a recent survey said findability is significantly critical to their organization’s goals and success, yet 85 percent also said less than 50 percent of their enterprise information is searchable online.

• Increased innovation: especially in data transformation, analysis, and visualization.

• Unmet potential: current estimates state that more than 80 percent of corporate information assets reside in unstructured content sources, including contracts, white papers, research documents, emails, PDFs, and beyond, yet 75 percent of IT resources are dedicated to structured data.

• Increased mergers and acquisitions: business intelligence and data warehousing (BI/DW) vendors are purchasing search and analytics technologies (SAP/Pilot) while search vendors are buying analytical applications (Autonomy/Zantaz). Some of the largest acquisitions in the software industry have occurred in the last 15 months (SAP/Business Objects, IBM/Cognos, Microsoft/FAST).

• Industry convergence: analysts are heralding the convergence of enterprise search with business intelligence as the ultimate answer to unified information access.

• Legacy architectures: attempts to bend search engines to accept structured data and to force databases to understand unstructured content have both proven to be failed approaches. Most legacy enterprise search engines (such as the AltaVista toolkit) were built from the same code base as the web search engines, and until recently, the fundamental technology used in search engines had not changed for decades.

The basic approach is still based on matching words typed into the search box against words in documents in the index. Incremental improvements to document retrieval and ranking have slowly been added, including:

• TF/IDF ranking (better documents contain more of the rare words in a query)
• Boolean filtering (Boston +Celtic +Music -Celtics)
• Stop word elimination (common words don't help search)
• Query expansion (using synonyms or related terms)
• Custom results ranking (e.g. by freshness, availability, or profit margin)

These techniques have generally resulted in small marginal improvements on the performance of information retrieval in the research lab, but whether or not they really improve the search experience for the end user can be questioned.

Changes in the search landscape
Competition in the enterprise search space has led to many novel, often wasteful, and sometimes confusing features. Most vendors now agree that scalability is a key requirement, but traditional search solutions require that the entire index be rebuilt each time additional resources are added to the configuration. This approach not only requires an infallible crystal ball, it ties up precious system resources. For true linear scalability, you must be able to add hardware as needed without impacting the running system, resulting in significantly lower total cost of ownership (TCO).

Also contributing to the historically high TCO of search is the cost of achieving good relevancy. It is commonplace to expect poor relevancy for an out-of-the-box installation of a legacy search engine. The cost of tuning the system is generally quite high, requiring significant adjustment of numerous configuration parameters. Obtaining a good relevance profile still remains largely a black art. Only recently has a new generation of search technologies enabled good out-of-the-box relevancy.

What about features that are touted to improve findability, such as “conceptual search” or “semantic search”? One version of semantic search involves using a large matrix (representing the frequency of terms in documents) to find latent semantic dimensions in the content, and then approximating documents by points in this alternative “semantic dimensional space”. The computational demands to this approach can be high; both when building the index and when matching documents to queries, and the results from searches can be unpredictable and even unrepeatable.

Other approaches extract “concepts” (key words and phrases) from documents using a variety of techniques, including lists of words and phrases, regular expressions, word bigrams (or trigrams), and advanced named entity extraction techniques created by Natural Language Processing (NLP) researchers. If the quality is high enough, extracted concepts can be used to enable corpus statistics, query enhancements, faceted browsing, and other forms of exploratory search. Unfortunately, the historical approach to faceted browsing has required that you pre-define your facets before indexing starts. If you want to change them, you likely will have to re-index the content again. This can take weeks.

Open source to the rescue?
What does the open source community bring to the party? The news is decidedly mixed. Several university labs make their research code available on liberal terms, but the systems do not have polished user interfaces. SourceForge has over 500 open-source “search engine” projects; most are incomplete. In simple terms, what Lucene and other open source solutions do, they do quite well, but they do little of what is actually needed to roll out a professional application. Lucene states clearly on its home page that it is merely a “library” and not a complete solution. In spite of some successes, building a search engine based on open-source components is still a task for experts.

One interesting recent development is openpipeline.org, which offers “open source software for crawling, parsing, analyzing and routing documents ... for enterprise search and document processing”. While the effort is to be commended, the results are less than stellar: most of the text extraction modules are proprietary, and the pipelines are statically configured. While documents can be processed in parallel, there is no support for conditional execution of components or for cycles in the pipeline, which are essential for processing complex documents such as emails and archives in zip format. In formal terms, the architecture is not Turing complete.

This is an issue for most of the commercially available engines as well. They follow a linear pipeline that does not support branching, conditional logic, or parallel processing. Processing zip files and emails with attachments is problematic for such as system. A far better approach is a looping workflow that indexes the container first and then the contained items, while maintaining the child/parent relationship.

A search engine that only produces results when the user types a query into a search box is very limited in today’s modern knowledge enterprise. Newer systems allow additional forms of input, including drilling down into datasets based on filters created by clicking a faceted UI, “geo-search” near the user’s current location, and even personalized search based on the user’s role, access rights, and prior search history. Some enterprise search systems have begun to offer APIs which enable end users (or their IT departments) to build standalone alerting applications with some effort. Why don’t the workflows support this natively?

Unified information access
It is increasingly important to search all of the data available in the enterprise. This includes documents in multiple formats (email, Word, PDF, etc.), as well as structured catalog and transaction data from product and accounting databases. The problem has been that even the most disciplined enterprise has at least two silos of retrieving information: search is used to find answers in content and BI/DW finds answers in data. Only recently have strides been made in combining search and BI/DW capabilities to deliver answers based on all corporate information assets. Delivering true unified information access, however, requires an approach that does not compromise the richness of either.

The JOIN operator in the SQL language is the lynchpin of relational database retrieval. It defines the cross-section of results among two or more database tables. For example, a request for, “our 100 best-selling products in the last quarter” would intersect the table of products with the table of invoices to determine the products that sold the most. The JOIN is possible because the invoice table contains a product ID number that links to the invoice’s product in the product table.

Now, imagine extending the JOIN to unstructured content like documents and email. To illustrate, let’s change our example to, “blog and press information about our 100 best-selling products in the last quarter”. A database engine would reshape the web logs and RSS news feeds to fit in the database and then perform the JOIN. The challenge would be to determine which logs and feeds are relevant to include in the first place. A search engine, on the other hand, would select the relevant logs and feeds, but determining the products would be hard. At the very least, the final search query would be quite long, “OR’ing” together every product name.

In general, search engines incorporate structured data incorrectly. Ingestion is a static operation preceding all queries; there is no ad hoc capability. Returning to the database for more data typically requires reconfiguring the index and re-indexing the content. A better solution would be to extract all the data from every table in the database at index time and perform the JOINs at query time.

The pace of information access innovation is increasing, and the old guard of enterprise search vendors is beginning to show their age. It is a truism in software that it is always easier to add new features to software than it is to fix the old ones.

Conclusion: Light at the end of the tunnel
As databases and search engines converge, legacy search engines will be found to be lacking support for database operations such as transactions and ETL (extract, transform, and load) processing. In a modern data repository, the data needs to be transformed via dynamically reconfigurable workflows. One static pipeline does not suffice, nor does re-building the index every time the schema changes.

The good news is that the predicted convergence of the database and search worlds is leading to some significant improvements in the search experience. As we move beyond the search box (the “user interface of last resort”), enterprise search solutions are beginning to support many different search modalities, including exploratory search, information discovery, and information synthesis. Navigation solutions are multiplying. Faceted search is already commonplace at major ecommerce sites.

Perhaps the most concrete realization of the new search experience is in the user interface. Partly due to the influence of the BI/DW world, we are beginning to see “dashboard-like” interfaces, including pie charts, scatter plots, and histograms. Tabbed interfaces allow navigation between datasets, while visualization techniques such as tag clouds, area charts, and heat maps accelerate the data exploration process.

The next generation of enterprise search engines consists of a componentized architecture implementing industry-standard protocols and built upon a flexible workflow engine that supports loops and branches. Clean architectural decisions drive down the total cost of ownership, permit dynamic reconfiguration across multiple servers, and enable flexible document processing with dynamic alerts on data which does not meet your standards.

biography
Jonathan Young is a senior research engineer at Attivio. While working at Dragon Systems, he built the speech recognition engine and speech user interface which is now known as Dragon NaturallySpeaking, and then built the Dragon AudioIndexing engine for multimedia information retrieval. Dr. Young is the inventor on 6 patents, and his current interests include statistical algorithms for intrusion detection, speech recognition, and natural language processing.