Seminar by Atish Das Sarma
Efficient Random Walk Computation on Large Graphs
Atish Das Sarma
Georgia Tech., Atlanta
Date: Friday, January 8, 2010
Time: 4:30 PM
Venue: CS101.
Abstract:
The advent of the internet has greatly increased our ability to produce as well as the need to process massive data sets. Examples of massive graphs include the world wide, social networks, distributed networks, and query-document click graphs. Traditional techniques for computing various graph properties at this scale prove prohibitively expensive, and there has been an increased interest in developing new techniques. We explore effcient computation of random walks in such large graphs. Random walks are a fundamental tool that have been widely used across all of computer science - theory, web algorithms, distributed networks, data mining, and even statistical physics. Some of the applications of random walks, in various settings, include estimating connectivity, sampling, volume estimation, search in networks, similarity detection, graph partitioning, and many more. In this work, we explore the computation of random walks in two different settings - graph streams, and distributed graph networks. Graph steaming is a natural framework for processing web-sized graphs; the input is stored on disk and streaming/sequential access is allowed while maintaining small working memory. The goal is to minimize the number of sequential passes required for the computation. Distributed graph networks arise in distributed computing theory and P2P networks, where each node in the network has only local knowledge about the graph topology. Computation is decentralized and proceeds in rounds where in each round, a node can communicate a restricted number of messages with its neighbors. The goal is to minimize the number of rounds of communication. In the first work, we present the first nontrivial streaming algorithm for performing several random walks of length $l$ in $\tilde{O}(\sqrt{l})$ passes, improving upon the naive $O(l)$ pass algorithm, when working with $\tilde{O}(n)$ working memory ($n$ is the number of nodes in the graph). As a consequence, we show how to approximate PageRank vectors (a pioneering idea in the initial Google search engine) in $\tilde{O}(\sqrt{M})$ passes improving upon the widely used $O(M)$ pass technique, where $M$ is the mixing time of the network. In the second work, we present a sublinear round decentralized algorithm for sampling from random walks of length $l$. The algorithm runs in $\tilde{O}(\sqrt{lD})$ rounds where $D$ is the diameter of the undirected network. This work improves upon the naive O(l) round algorithm that has been used in distributed networks for decades in several applications such as constructing and maintaining expanders, sampling, and load monitoring. We further show a lower bound of $\Omega(\sqrt{l/\log l})$ rounds illustrating that our dependence on $l$ is near-optimal. The first part is joint work with S. Gollapudi and R. Panigrahy, and the second part is joint work with D. Nanongkai, G. Pandurangan, and P. Tetali.