Seminar by Srikanta Bedathur
It's about time - Searching and Mining of Large-Scale Archives of Text Data
Srikanta Bedathur
Max-Plank Institute for Informatics, Saarbrucken
Date: Monday, August 31, 2009
Time: 2:00 PM
Venue: CS102.
Abstract:
The World Wide Web is fast becoming the de facto platform for disseminating and accessing content. In addition to the huge amount of scientific, cultural and social information that is already available, archival collections are increasingly becoming available as another kind of valuable content. These collections include Web archives such as those from the Internet Archive, digital archives of newspapers, versioned encyclopedic sources like Wikipedia, blogs that reflect the zeitgeist of the society, and many others. These archives make it possible to trace the evolution of content and are a very valuable resource. Searching and analyzing them requires close attention to the temporal information -- present both explicitly and implicitly -- and the effects of content change over time.
In this talk, I describe our ongoing work in building novel text search and mining features that pay special attention to the time-axis of these archives. The first part focuses on search problems where I introduce the notion of time-travel keyword queries and their evaluation along two temporal dimensions: (i) transactional times of the documents based on their publication times, and (ii) reference time of contents in the documents. In the second part of the talk, I present the issue of terminology evolution in archives and methods to mitigate the associated search quality problems.
Finally, if time permits, I will briefly outline the Everlast project for building a scalable end-to-end infrastructure for Web archive gathering, indexing and searching.