Music-streaming service Spotify says shifting to Hortonworks' distribution ofHadoop will help it advance the analytics behind personalising user services and providing data to record companies.
Spotify's Hadoop infrastructure, which stood at about 30 nodes five years ago, is now described as Europe's largest commercial cluster, consisting of 690 nodes storing data from more than 24 million active users and six million subscribers.
The Stockholm- and London-headquartered company plans to start the migration from Cloudera to the Hortonworks Data Platform (HDP) distribution of the open-source Apache Hadoop distributed computing platform at the end of October.
Hortonworks will be providing the inhouse team with service and support, along with six-monthly assessments of the Spotify Hadoop cluster.
in July by data-integration company Syncsort showed Cloudera leading the field with 41 percent of the framework's use in Europe, followed by core Apache code on 30 percent, Hortonworks on 18 percent and MapR on nine percent.
According to Spotify team lead for data infrastructure Wouter de Bie, the true open-source approach adopted by Hortonworks is a key factor in Spotify's decision to go with HDP.
"The work they have done to improve the Apache Hive data warehouse system also aligns well with our needs, as we use Hive extensively for ad-hoc queries and for the analysis of large datasets," de Brie said in a statement.
HDP will be running on the Debian operating system. Hortonworks said its work with Spotify will enable Hortonworks to offer HDP to customers running either the Debian or Ubuntu operating systems in the future.
Spotify, which was developed in 2006 and launched publicly in 2008, analyses users' listening habits to personalise its services and deliver reports about music downloads to record companies, including EMI, Sony, Universal and Warner Music Group.
Spotify's initial efforts with Hadoop were confined to a few nodes hosted in Amazon's Elastic MapReduce (EMR) cloud but when the cluster grew to about 40 nodes it encountered problems running the distributed computing platform because of the heterogeneous nature of Spotify's environment.
"At that time we needed to scale fast, but we didn't know where we would end up. After 18 months we noticed that data had exploded at Spotify and that the costs were pretty high at Amazon," a Spotify spokesman said.
"Also, we didn't have the right tooling in place so that developers could easily develop on top of EMR. Since all our other services are inhouse as well, we decided to start with a 60 node cluster in-house. That has expanded to 690 nodes over the past 18 months."