hide
Free keywords:
-
Abstract:
As Web 2.0 and enterprise-cloud applications have proliferated, data mining
algorithms increasingly need to be (re)designed to handle web-scale
datasets. For this reason, low-rank matrix factorization has received a lot
of attention in recent years, since it is fundamental to a variety of mining
tasks, such as topic detection and collaborative filtering, that are
increasingly being applied to massive datasets. We provide a novel algorithm
to approximately factor large matrices with millions of rows, millions of
columns, and billions of nonzero elements. Our approach rests on stochastic
gradient descent (SGD), an iterative stochastic optimization algorithm; the
idea is to exploit the special structure of the matrix factorization problem
to develop a new ``stratified'' SGD variant that can be fully distributed
and run on web-scale datasets using, e.g., MapReduce. The resulting
distributed SGD factorization algorithm, called DSGD, provides good speed-up
and handles a wide variety of matrix factorizations. We establish
convergence properties of DSGD using results from stochastic approximation
theory and regenerative process theory, and also describe the practical
techniques used to optimize performance in our DSGD
implementation. Experiments suggest that DSGD converges significantly faster
and has better scalability properties than alternative algorithms.