I was offered the opportunity to communicate with Dr. Edward Fox from Virginia Tech about his disaster preparedness project and his success using LucidWorks.
My name is Dr. Edward A. Fox. I serve in three related roles that tie to our work on disaster preparedness. I'm Professor, Department of Computer Science (where I teach courses and engage in related research and service), Faculty Adviser to Virginia Tech's Vice President for Information Technology (including advising about campus IT, and liaising with University Libraries), and the Director of the Digital Library Research Laboratory (DLRL). The DLRL was established as a Virginia Tech research laboratory, supported by Information Technology and the Department of Computer Science.
Over the years, our work related to digital libraries has covered theory, software and system development, experimentation, and human issues. Our mission statement explains that we work on: integrating the best of information retrieval, multimedia, hypermedia, and visualization with the best and most humanistic aspects of living libraries. Virginia Tech is a top research university, part of the state system in Virginia, and one of our nation's landgrant universities. It has about 30K students.
Crisis, Tragedy, and Recovery Network (an NSF supported grant project), is a digital library network that researches a broad range of services relating to different kinds of tragic events. Through this digital library, we collect and archive different types of CTR related information such as Websites, Photos, Videos, Blogs, and Tweets.
Over the last 5 years, we have collected and archived data about scores of different crisis events that have occurred around the globe. This has benefited from our collaboration with the Internet Archive. The total size of the data collected is over 10 TB and so we were in need of a software solution like LucidWorks to process data and provide services through our digital library.
We have used open source software packages like Lucene, Solr, and Weka. We also have employed a number of software systems over the years that we developed, like SMART, CODER, MARIAN, Envision, ETANA, CITIDEL, Ensemble, and SuperIDR. Having worked since 1978 with information retrieval and multimedia technologies, as well as artificial intelligence and machine learning software, we considered related algorithms and toolkits. We also have worked with repository, content management, and digital library systems connected with the Open Archives Initiative, including DSpace, Fedora, and Drupal.
The LucidWorks Big Data framework fits well with our goal of processing/indexing large collections of web archive (WARC) files. Since we have around 10 TB of WARC files and since LucidWorks can ingest and process WARC files directly (through Hadoop and its file processing, as well as related workflows), our needs were directly met.
In addition, the packaging together of a number of open source software tools into an integrated system, that works well for corporate clients too, has meant that it is faster for us to use this platform in classes and in research projects. Further, we were able to work out a collaborative connection with LucidWorks, in return for our sharing and disseminating related educational modules suitable for those interested in LucidWorks.
First, we have benefited from a smoothly operating suite of software.
Second, we have benefited from support provided by LucidWorks personnel as we have worked to make the software operate on three different hardware systems.
Third, we received assistance through documentation and support when using the software in a graduate level course on Information Storage and Retrieval.
Fourth, once we had configured the software for System G, a 'green' supercomputer at Virginia Tech, we are able to quickly process data for events like the Boston Marathon Bombing and to develop a series of supporting User Interface prototypes. Now that we have processed a part of our data using the software, it appears certain that we can meet our goal to process the 10 TB of data using the software and then be able to make the processed data available for public/research use.
Finally, thanks in part to an agreement that LucidWorks will continue to collaborate, it appears likely that we may receive support from NSF for a follow-on project to extend and broaden the CTRnet work.
LucidWorks is a good partner to work with. Though it is important to understand one's application and to have technical expertise in the area, as well as to study about each of the tools connected with LucidWorks, they are quite helpful on technical and related matters, and provide high quality support.
It took us some time to configure the software to fit with our design, but once we were done with that, it was easy to process the incoming data automatically and to have the results accessible through the User Interface.
We are confident that others who face a similar issue regarding processing a large amount of similar data can tailor the product to their needs and enjoy the results.
Finally, we recommend that those interested consider using the suite of educational modules we have developed.