How does Spotify use big data?

read a answer on quora about how does Spotify use big data

the following comments is posted by Erik Bernhardsson, Tech Lead at Spotify

We use it for a ton of things. Spotify is just in the process of upgrading to a 700 node Hadoop cluster and we probably run 2000+ jobs every day. We use it for a lot of things, including toplists, recommendations, ad forecasting, business analytics, and lots of other things. Most of the Hadoop jobs are in Python or Hive, but we also run some stuff in pure Java, Pig and Scala (scalding).

We open sourced our workflow manager Luigi a while back. Here’s a presentation by me: Luigi Presentation at OSCON 2013

Apart from Hadoop, we use Cassandra extensively. We’re also running a test cluster with Storm and Kafka and we might start using it in production later this year.

Probabilistic latent semantic analysis is one method that works pretty well in the implicit context. We use it and related methods for our recommender system at Spotify.

PLSA

http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

$log \prod_{(u,i) \in R} p(u,i)^{N_{ui}} = \sum_{(u,i) \in R} N_{ui} log p(u,i) = \sum_{(u,i) \in R} N_{ui} log \sum_z p(u|z)p(i,z)$

Alternative definition for $p(u,i)$

$p(u,i) = exp(a^t_u b_i)/Z\\ Z= \sum_{u,i} exp(a^t_u b_i)$

Thus, the log likelihood is:

$$ L= (\sum_{u,i} a_u^T b_i) - T log Z $$ The calculation of Z involves summation of all possible user item combination instead of only sweeping over observed rating. That leave the question how to efficiently estimate Z. The author didn’t mention the this approach. Maybe sampling algorithm can be incorporated to estimate this term.

With the estimated Z, the gradient for updating $a_u$ is

$\frac{\partial L}{\partial a_u} = (\sum_{i} n_{u,i} b_i) - \frac{T}{Z} (\sum_i b_i exp(a_u^T b_i))$

简记·思行-SiNZeRo

Sunshine and Imagination

PLSA

Alternative definition for $p(u,i)$

Comments