Unnatural language processing

Indexing a large chunk of data is a bit like joining Weight Watchers: it's a useful first step, but it doesn't immediately solve the problem of how you're going to deal with all that blubber.

Indexing a large chunk of data is a bit like joining Weight Watchers: it's a useful first step, but it doesn't immediately solve the problem of how you're going to deal with all that blubber.

Getting the indexing to be more intelligent — that is, working out what's actually being said, rather than just what sequences of characters are in place — is nearly as challenging as resisting just one more Tim Tam.

I was reminded of this last week when enterprise analysis software specialist SAS announced that it was buying out Teragram, a company which specialises in just that kind of process.

More specifically, Teragram uses "large annotated dictionaries containing several hundred million words in more than 30 languages" to help categorise documents according to criteria set by users. SAS will use the technology to enhance its Text Miner software and other products (though it will maintain Teragram as a separate "SAS company").

Now, I did an honours degree in linguistics rather too long ago, so I have a bit of an interest in how language processing works. Teragram is essentially using the brute force approach: lots of lots of data to handle lots and lots of potential scenarios.

In an era where processing power is ludicrously cheap, that's not a terrible approach. But it's nowhere near as elegant as an algorithm based on a more nuanced understanding of how language actually works. Our understanding of language is still far too fragmentary to make such an approach entirely feasible, but it remains a worthy goal — and it would produce smaller, faster software in the long run.

Such coding concerns aside, there's another more insidious problem. Processing text is hard enough when it's written in a relatively coherent fashion. But as anyone who hangs around on message boards, Wikipedia talk pages or classrooms can tell you, in the SMS-speak age assuming that to be the case is dangerous.

For an increasing number of people of all ages, capital letters are a foreign language, punctuation is a waste of space, accuracy in spelling is optional and sentences are like you know words what go 2gether they dont have to mak sense much lol.

While business communications should arguably still be more formal (and accurate), I wouldn't want to stake money (or a Tim Tam) on it.

If the trend continues, text mining intelligence will also need a degree of "un-intelligence" — trying to extract meaning from something that really didn't have any meaning in the first place. That might well require a lot more examples, which is good for storage manufacturers if nothing else.