Over 150 billion email messages are currently being sent every day from about 3.5 billion email accounts worldwide. So, it’s easy to understand why we miss important messages as we struggle to keep up with the surge. For some, the situation has become so bad that email is no longer a reliable way to get in touch with them since they can’t quickly sort out the important stuff.
A new research project coming out of Israel looks to solve part of the problem by using big data to boil down email messages to their most important information and summarize them to be digested much more quickly, especially on mobile devices.
The project is led by Mark Last, an assistant professor at Ben Gurion University in Be’er-Sheva, Israel, and it’s focused on using algorithms to summarize blocks of text into their most important elements. From the standpoint of email, this could have two main benefits:
1. Create one-sentence email summaries to be used in preview panes so that users can quickly flip through a list of messages and see the main idea of each email without having to open it.
2. Summarize long emails into 100-200 words that highlight the key points.
Last and his team of researchers at BGU are accomplishing this by using the tools of big data. In fact, Last has been working on big data and using it to solve problems since 1996 when he was a PhD student at Tel Aviv University. That was long before it was ever called “big data.” Back then it just data mining with unstructured data -- one of the key elements of big data -- and Last’s PhD sponsor barely understood the web mining and text mining research that he was doing.
Now, Last, who was born in Russia and came to Israel as a kid in 1977, is putting that experience to good use in what has become one of the hottest fields in IT. In 2008, he became a professor of Information Systems Engineering at Ben Gurion University and one of his big projects as been using text mining to find terrorist sites on the web.
There are tens of thousands of terrorist organization sites on the Internet, but they often disguise themselves as news, information, or community sites. Last and his team have used algorithms called “characterization models” to scan the web and pinpoint terrorist sites by identifying words that they use repeatedly, such as “enemy” and “martyr,” and phrases that they try to dance around, such as saying “human bomb” rather than "suicide bomber."
Clearly, this kind of data mining is different from the text summarization that is used in the email research mentioned.
"In data mining, in text mining, we have different methods, different tools," said Last.
Nevertheless, the work supports each other and both are aspects of big data. The text summarization work started as an initiative to help summarize lots of news articles, short books, and documents on the web. This was especially useful to intelligence agencies, who used this technology to comb through thousands of news reports and web documents as quickly as possible. They could look at the 100-200 word summaries of the pages/documents and then decide which ones deserved a further look verses which ones they could avoid wasting time on.
Out of that grew the idea from Last and his team to apply this idea to email, where summarization could be employed to help quickly sort through messages and find the ones that you need to pay attention to versus the ones you can safely ignore.
The work started in English since so much of the web is primarily in English, and there are already very good natural language processing tools in English. However, the work has now progressed to Hebrew, Arabic, and other languages as well. Ultimately, they have developed a new method of text summarization that is language-independent, and that’s a big part of the magic.
The algorithm scans sentences in the document and first calculates metrics such as number of words and the relation of words in the sentence. The second stage is weighting the sentences to find the most important ones. The algorithm also looks for summaries created by humans (in the case of documents and news articles) and then looks for similar words, phrases, and themes to help the text summarization.
While the technology is improving by leaps and bounds that doesn't necessarily mean it's coming to any commercial email services in the immediate future.
"In general, we’re not here to develop products," said Last. "We're here to develop methodologies."
Last said that his team has applied for a U.S. patent on their text summarization methods. Ben Gurion University is very resourceful about licensing and selling its patents to commercial companies (as discussed in my article on Israel's quest to build the next Silicon Valley), so there's the potential that this will eventually be productized. There's also the possibility that a company like Google is already working on something similar.
However, the email service that could potentially benefit from text summarization first could be Yahoo Mail, since Yahoo bought text summarization pioneer Summly in March. Summly was the brainchild of British teen entrepreneur Nick D'Aloisio. Last said his team exchanged many emails with D'Aloisio on the similarities of their work (before Yahoo bought D'Aloisio's company) and was very complimentary about the work that D'Aloisio was doing and the attention that it received. Last viewed it as validation for the power of text summarization and its rising importance.
I like to think of text summarization as one of the fruits of big data that will have a direct impact on consumers and professionals alike.