Quocirca's Straight Talking: Searching for search technology

'Seek and ye shall find' - if only...

'Seek and ye shall find' - if only...

As enterprises amass more and more data, the task of searching through it has become increasingly unwieldy. Quocirca's Jon Collins examines the current options for easing info management headaches.

Seek and ye shall find, goes the adage - but this is clearly not a concept that has yet been taken on board by the mainstream of computing. While the principle of traipsing algorithmically through a file system or database is well-established, such mechanisms are often primitive, unwieldy and slow. Consider the 'find file' mechanism in Microsoft Windows, or the search facility in any email tool.

Equally established is the principle of generating and maintaining an index of information to enable searches to proceed more efficiently. However, such capabilities have in general remained too expensive for general use, remaining instead in the domain of specialist applications or top-end content management systems.

To be fair, Microsoft Office bundles the 'find fast' indexing tool but it is recognised to be resource-hungry and inefficient. IBM's Lotus Domino also has a fully indexed search tool, which is fine for Domino users but of little use for anyone else.

Ultimately it is the users who suffer, wasting time as they hunt for that necessary detail of past fact, that contact name, that phone number, that diagram or reference. As the volumes of information we all have to deal with continue to increase, these times extend to eat away at our efficiency, and add to the frustration of working with computers. In many cases, the alternative to finding a necessary piece of information is to go without.

The growing number of personal productivity tools provide a modicum of searchability. For email, there is the most excellent Nelson Email Organizer (NEO) from Caelo Software. This sifts through multiple Outlook email folders and allows them to be searched by content, by name, date and even by attachment type, providing 'views' (pre-defined search queries) to enable searches to be repeated as and when. NEO has its limitations - it does not offer searching within attachments, for example. Yet it does offer a welcome lifeline for anyone drowning in email.

The increasingly ubiquitous Google has ported its own search technology to offer a free desktop tool, offering an alternative (yet more basic) way to search both emails and email attachments, as well as other files stored locally.

For the best of both worlds the two mechanisms can be run in parallel, though the burden on CPU starts to become noticeable on a less than current machine. Microsoft also offers what it terms 'Windows Desktop Search' and has recently acquired an email search technology which it is no doubt currently integrating into Outlook.

The speed at which such tools can become indispensable is indicative of how difficult it can be to find information in the first place. For individuals they may be life-changing but from a corporate perspective they can never be considered as more than a good start. The scope of a search can always be widened, and provided it can happen efficiently, it will always be preferable to search across multiple file shares in addition to the local desktop.

Ultimately, it should be possible to search across the entirety of a corporation's electronic records, be they stored as emails, databases, web pages, Word files, PDFs or whatever esoteric format the company may be using. The fact that this is so clearly a pie in the sky ideal for many companies does not make it any less desirable.

To achieve such a vision is technologically possible today. Not least, Google offers an enterprise version of its search technologies which enables companies to search efficiently across multiple types of repository. There are a handful of other specialist search companies, for example the Norwegian firm FAST, whose technology comes highly recommended by a number of governments and other large organisations. Such enterprise-wide capabilities do not come cheaply however. Other software companies working on solving one part of the problem include Zantaz, whose strength is in digging useful information out of email archives. (continued on next page...)

So we've established that the capability to search exists in a variety of forms, and at a range of price points. It seems that only the human trait of making do is preventing us from adopting more efficient, effective ways of finding information. While it would be simple to berate the reader for not running at least Google Desktop Search (assuming the computer can cope with the small yet noticeable overhead), perhaps we should also be wondering why such facilities were not built in from the start. Why didn't earlier versions of Microsoft Windows, Unix or other operating systems offer indexing as part of the package?

The straightforward answer is: nobody thought of it. Secondly, in the past computers were not powerful enough to merit wasting valuable processor cycles on peripheral activities such as indexing. In today's world where we have desktops and laptops hugely over-specified for the majority of tasks, we are now in a position to add index-based search to our personal tool chests. It's the same in the server room: many jobs will still require absolutely every available ounce of CPU but the trend towards consolidation is symptomatic of the fact that modern commodity servers are more than adequate for most jobs.

The time for search is now, and it should come as no surprise that the major software vendors are indeed adding more advanced search capabilities to their own products. The next version of Microsoft Windows, Vista (formerly Longhorn), will offer index-based search out of the box, with the addition of tagging mechanisms which essentially allow direct customisation of the index. As with Domino, IBM's Workplace collaboration platform will also offer an integrated, index-based search facility. The choice is there: end user organisations can either wait for search capabilities to arrive built-in to these next generation products, or they can adopt a third party product to fill the gap.

There is plenty of good to be gained from better search facilities but there are many issues to be overcome as well. It would be trite to suggest that any product could offer seamless interfaces into whatever legacy repositories a company might throw at it. Some content management packages cannot even communicate between instances of themselves, never mind offering dynamic access to a third-party product. To implement search should be seen as a goal, rather than a one-shot operation, and assiduous application of the Pareto principle (identifying the 20 per cent of data stores that have 80 per cent of the business value) is necessary to avoid wasting time and money.

As well as the technical hurdles, a wealth of potential pitfalls can be caused by making information more accessible, not least of which is security. Much information today is protected from prying eyes only because it is difficult to find. Search mechanisms often prove themselves too clever for their own good, discovering information that should have remained hidden. Sometimes it is not necessary to be able to read a document to cause a security breach. For example, the mere existence of a spreadsheet on the human resources file share containing the term 'planned redundancies' would be enough information for many.

Search may require changes to how businesses operate to ensure these capabilities remain within corporate guidelines and security policies. As data volumes and types continue to grow, however, search will inevitably turn from a 'nice to have' to a 'need to have'.

New technologies give us new ways of working with information. RFID, for example, enables links to be created between electronic data and physical items, and streamed content gives easy access to large video files. But both bring with them significantly more data and thus fresh difficulties for finding information.

Meanwhile, the growing and necessary burden of corporate governance, either to meet standards or legislation, bring an added impetus to the proceedings. Even if organisations are not ready for any big bang, they can nonetheless look to understand what level of search tools might be appropriate for their current business requirements.

By putting in place an understanding now, companies and other bodies can start to prepare for the inevitable point where there is just too much information to be indexed and sifted through manually.