You're reading from Hands-On Recommendation Systems with Python

Product typeBook

Published inJul 2018

Reading LevelExpert

PublisherPackt

ISBN-139781788993753

Edition1st Edition

Languages

Python

Tools

TensorFlow Scikit-learn

Concepts

Machine Learning

Author (1)

Rounak Banik

Manipulating Data with the Pandas Library

In the next few portions of the book, we are going to get our hands dirty by building the various kinds of recommender systems that were introduced in chapter one. However, before we do so, it is important that we know how to handle, manipulate, and analyze data efficiently in Python.

The datasets we'll be working with will be several megabytes in size. Historically, Python has never been well-known for its speed of execution. Therefore, analyzing such huge amounts of data using vanilla Python and the built-in data structures it provides us is simply impossible.

In this chapter, we're going to get ourselves acquainted with the pandas library, which aims to overcome the aforementioned limitations, making data analysis in Python extremely efficient and user-friendly. We'll also introduce ourselves to the Movies Dataset that...

Technical requirements

You will be required to have Python installed on a system. Finally, to use the Git repository of this book, the user needs to install Git.

The code files of this chapter can be found on GitHub:
https://github.com/PacktPublishing/Hands-On-Recommendation-Systems-with-Python .

Check out the following video to see the code in action:

http://bit.ly/2LoZEUj .

Setting up the environment

Before we start coding, we should probably set up our development environment. For data scientists and analysts using Python, the Jupyter Notebook is, by far, the most popular tool for development. Therefore, we strongly advise that you use this environment.

We will also need to download the pandas library. The easiest way to obtain both is to download Anaconda. Anaconda is a distribution that comes with the Jupyter software and the SciPy packages (which includes pandas).

You can download the distribution here: https://www.anaconda.com/download/.

The next step is to create a new folder (I'm going to name it RecoSys) in your desired location. This will be the master folder that contains all the code we write as part of this book. Within this folder, create another folder named Chapter2, which will contain all the code we write as part of this chapter...

The Pandas library

Pandas is a package that gives us access to high-performance, easy-to-use tools and data structures for data analysis in Python.

As we stated in the introduction, Python is a slow language. Pandas overcomes this by implementing heavy optimization using the C programming language. It also gives us access to Series and DataFrame, two extremely powerful and user-friendly data structures imported from the R Statistical Package.

Pandas also makes importing data from external files into the Python environment a breeze. It supports a wide variety of formats, such as JSON, CSV, HDF5, SQL, NPY, and XLSX.

As a first step toward working with pandas, let's import our movies data into our Jupyter Notebook. To do this, we need the path to where our dataset is located. This can be a URL on the internet or your local computer. We highly recommend downloading the data...

The Pandas DataFrame

As we saw in the previous section, the df.head() code outputted a table-like structure. In essence, the DataFrame is just that: a two-dimensional data structure with columns of different data types. You can think of it as an SQL Table. Of course, just being a table of rows and columns isn't what makes the DataFrame special. The DataFrame gives us access to a wide variety of functionality, some of which we're going to explore in this section.

Each row in our DataFrame represents a movie. But how many movies are there? We can find this out by running the following code:

#Output the shape of df
df.shape

OUTPUT:
(45466, 24)

The result gives us the number of rows and columns present in df. We can see that we have data on 45,466 movies.

We also see that we have 24 columns. Each column represents a feature or a piece of metadata about the movie. When we ran...

The Pandas Series

When we accessed the Jumanji movie using .loc and .iloc, the data structures returned to us were Pandas Series objects. You may have also noticed that we were accessing entire columns using df[column_name]. This, too, was a Pandas Series object:

type(small_df['year'])

OUTPUT:
pandas.core.series.Series

The Pandas Series is a one-dimensional labelled array capable of holding data of any type. You may think of it as a Python list on steroids. When we were using the .apply() and .astype() methods in the previous section, we were actually using them on these Series objects.

Therefore, like the DataFrame, the Series object comes with its own group of extremely useful methods that make data analysis a breeze.

First, let's check out the shortest- and longest-running movies of all time. We will do this by accessing the runtime column of the DataFrame as...

Summary

In this chapter, we gained an understanding of the limitations of using vanilla Python and its built-in data structures. We acquainted ourselves with the Pandas library and learned how it overcomes the aforementioned difficulties by giving us access to extremely powerful and easy-to-use data structures. We then explored the two main data structures, Series and DataFrame, by analyzing our movies-metadata dataset.

In the next chapter, we will use our newfound skills to build an IMDB Top 250 Clone and its variant, a type of knowledge-based recommender.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Recommendation Systems with Python

Published in: Jul 2018Publisher: PacktISBN-13: 9781788993753

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rounak Banik

Rounak Banik is a Young India Fellow and an ECE graduate from IIT Roorkee. He has worked as a software engineer at Parceed, a New York start-up, and Springboard, an EdTech start-up based in San Francisco and Bangalore. He has also served as a backend development instructor at Acadview, teaching Python and Django to around 35 college students from Delhi and Dehradun. He is an alumni of Springboard's data science career track. He has given talks at the SciPy India Conference and published popular tutorials on Kaggle and DataCamp.
Read more about Rounak Banik

Other recommended products

Related to this chapter

Python Machine Learning Workbook for Beginners

Through a series of machine learning and data science projects, this book represents a beginner-friendly crash course to Python’s practical application in businesses and your own career.

BookMar 2021279 pages

Mastering Predictive Analytics with scikit-learn and TensorFlow

In this book, you will find a range of methods to improve the performance of almost any predictive model, from ensemble methods to dimensionality reduction and cross-validation. You will learn the tools to produce advanced predictive models. In addition, you will dive into the exiting field of Deep Learning using TensorFlow.

BookSep 2018154 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

R Data Analysis Projects

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it is one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. In this book, we show you just how to do that - with the help of practical implementations of real-world use cases.

BookNov 2017366 pages

Supervised Machine Learning with Python

A supervised learning task infers a function from flagged training data and maps an input to an output based on sample input-output pairs. In this book, you will learn various machine learning techniques (such as linear and logistic regression) and gain the practical knowledge you need to quickly and powerfully apply algorithms to new problems.

BookMay 2019162 pages

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

This book covers the theory and practice of building data-driven solutions. Includes the end-to-end process, using supervised and unsupervised algorithms. With each algorithm, you will learn the data acquisition and data engineering methods, the apt metrics, and the available hyper-parameters. You will learn how to deploy the models in production.

BookJul 2020384 pages

R Machine Learning Projects

The purpose of the book is to help a machine learning practitioner gets hands-on experience in working with real-world data and apply modern machine learning algorithms. You will learn to implement each algorithm to a specific industry problem. It covers projects involving both supervised as well as unsupervised learning approaches.

BookJan 2019334 pages

Hands-On Data Science and Python Machine Learning

This book will help you take your first steps in the world of data science. It will empower you to conduct data analysis and perform efficient machine learning using Python. You will gain value from your data using the various data mining and data analysis techniques in Python, and develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark.

BookJul 2017420 pages

Machine Learning Solutions

This book demonstrates a set of simple to complex problems you may encounter while building machine learning models. You'll not only learn the best possible solutions to these problems but also find out how to build projects based on each problem mentioned in the book, with a practical approach and easy-to-follow examples.

BookApr 2018566 pages

Feature Engineering Made Easy

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective.

BookJan 2018316 pages

Learning Data Mining with Python

The next step in the information age is to gain insights from the deluge of data coming our way. Data mining provides a way of finding these insights, and Python is one of the most popular languages for data mining because it provides both power and flexibility in analysis.

BookApr 2017358 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages