Reader small image

You're reading from  Mastering Spark for Data Science

Product typeBook
Published inMar 2017
PublisherPackt
ISBN-139781785882142
Edition1st Edition
Concepts
Right arrow
Authors (4):
Andrew Morgan
Andrew Morgan
author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

Antoine Amend
Antoine Amend
author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

Matthew Hallett
Matthew Hallett
author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

David George
David George
author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George

View More author details
Right arrow

Chapter 8. Building a Recommendation System

If one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere. The reason for their popularity is down to their versatility, usefulness, and broad applicability. Whether they are used to recommend products based on user's shopping behavior or to suggest new movies based on viewing preferences, recommenders are now a fact of life. It is even possible that this book was magically suggested based on what marketing companies know about you, such as your social network preferences, your job status, or your browsing history.

In this chapter, we will demonstrate how to recommend music content using raw audio signal. For that purpose, we will cover the following topics:

  • Using Spark to process audio files stored on HDFS

  • Learning about Fourier transform for audio signal transformation

  • Using Cassandra as a caching layer between online and offline...

Different approaches


The end goal of a recommendation system is to suggest new items based on a user's historical usage and preferences. The basic idea is to use a ranking for any product that a customer has been interested in in the past. This ranking can be explicit (asking a user to rank a movie from 1 to 5) or implicit (how many times a user visited this page). Whether it is a product to buy, a song to listen to, or an article to read, data scientists usually address this issue from two different angles: collaborative filtering and content-based filtering.

Collaborative filtering

Using this approach, we leverage big data by collecting more information about the behavior of people. Although an individual is by definition unique, their shopping behavior is usually not, and some similarities can always be found with others. The recommended items will be targeted for a particular individual, but they will be derived by combining the user's behavior with that of similar users. This is the famous...

Uninformed data


The following technique could be seen as something of a game changer in how most modern data scientists work. While it is common to work with structured and unstructured text, it is less common to work on raw binary data the reason being the gap between computer science and data science. Textual processing is limited to a standard set of operations that most will be familiar with, that is, acquiring, parsing and storing, and so on. Instead of restricting ourselves to these operations, we will work directly with audio transforming and enrich the uninformed signal data into informed transcription. In doing this, we enable a new type of data pipeline that is analogous to teaching a computer to hear the voice from audio files.

A second (breakthrough) idea that we encourage here is a shift in thinking around how data scientists engage with Hadoop and big data nowadays. While many still consider these technologies as just yet another database, we want to showcase the vast array...

Building a song analyzer


However, before deep diving into the recommender itself, the reader may have noticed an important property that we were able to extract out of the signal data. Since we generated audio signatures at regular time intervals, we can compare signatures and find potential duplicates. For example, given a random song, we should be able to guess the title, based on previously indexed signatures. In fact, this is the exact approach taken by many companies when providing music recognition services. To take it one step further, we could potentially provide insight into a band's musical influences, or further, perhaps even identify song plagiarism, once and for all settling the Stairway to Heaven dispute between Led Zeppelin and the American rock band Spirit http://consequenceofsound.net/2014/05/did-led-zeppelin-steal-stairway-to-heaven-legendary-rock-band-facing-lawsuit-from-former-tourmates/.

With this in mind, we will take a detour from our recommendation use case by continuing...

Building a recommender


Now that we've explored our song analyzer, let's get back on track with the recommendation engine. As discussed earlier, we would like to recommend songs based on frequency hashes extracted from audio signals. Taking as an example the dispute between Led Zeppelin and Spirit, we would expect both songs to be relatively close to each other, as the allegation is that they share a melody. Using this thought as our main assumption, we could potentially recommend Taurus to someone interested in Stairway to Heaven.

The PageRank algorithm

Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to the way they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing...

Summary


While our recommendation system may not have taken the typical textbook approach, nor may it be the most accurate recommender possible, it does represent a fully demonstrable and incredibly interesting approach to one of the most commonplace techniques in data science today. Further, with persistent data storage, a REST API interface, distributed shared memory caching, and a modern web 2.0-based user interface, it provides a reasonably complete and rounded candidate solution.

Of course, building a production-grade product out of this prototype would still require much effort and expertise. There are still improvements to be sought in the area of signal processing. For example, one could improve the sound pressure and reduce the signal noise by using a loudness filter, http://languagelog.ldc.upenn.edu/myl/StevensJASA1955.pdf, by extracting pitches and melodies, or most importantly, by converting stereo to a mono signal.

Note

All these processes are actually part of an active area of...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Spark for Data Science
Published in: Mar 2017Publisher: PacktISBN-13: 9781785882142
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George