Reader small image

You're reading from  Hands-On Data Analysis with Pandas - Second Edition

Product typeBook
Published inApr 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800563452
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Stefanie Molin
Stefanie Molin
author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin

Right arrow

Chapter 12: The Road Ahead

Throughout this book, we have covered a lot of material, and you are now capable of performing data analysis and machine learning tasks entirely in Python. We began our journey by learning about some introductory statistics and how to set up our environment for data science in Python. Then, we learned about the basics of using pandas and how to bring data into Python. With this knowledge, we were able to work with APIs, read from files, and query databases to grab data for our analyses.

After we collected our data, we learned how to perform data wrangling in order to clean up our data and get it into a usable format. Next, we learned how to work with time series and combine data from different sources as well as aggregate it. Once we had a good handle on data wrangling, we moved on to visualizations and used pandas, matplotlib, and seaborn to create a variety of plot types, and we also learned how to customize them.

Armed with this knowledge, we were...

Data resources

As with any skill, to get better we need to practice, which for us means we need to find data to practice on. There is no best dataset to practice with; rather, each person should find data that they are interested in exploring. While this section is by no means comprehensive, it contains resources for data from various topics in the hopes that everyone will find something they want to use.

Tip

Unsure of what kind of data to look for? What are some of the things you have wondered about related to a topic that you find interesting? Has data been collected on this topic, and can you access it? Let your curiosity guide you.

Python packages

Both seaborn and scikit-learn provide built-in sample datasets that you can experiment with in order to get more practice with the material we've covered in the book and to try out new techniques. These datasets are often very clean and thus easy to work with. Once you're comfortable with the techniques, you can...

Practicing working with data

Throughout this book, we have worked with various datasets from different sources with step-by-step instructions. It doesn't have to stop here, though. This section is dedicated to some resources that can be used to continue with guided instruction and, eventually, work toward building a model for a predefined problem.

Kaggle (https://www.kaggle.com/) offers content for learning data science, datasets for exploration that are shared by members of the community, and competitions that have been posted by companies—perhaps the Netflix recommendation contest sounds familiar (https://www.kaggle.com/netflix-inc/netflix-prize-data)? These contests are a great way for you to practice your machine learning skills and become more visible in the community (especially to potential employers).

Important note

Kaggle isn't the only place you can participate in data science competitions. Some additional ones are listed at https://towardsdatascience...

Python practice

We have seen throughout this book that working with data in Python isn't just pandas, matplotlib, and numpy; there are many ways our workflow can benefit from us being strong Python programmers in general. With strong Python skills, we can build web applications with Flask, make requests of an API, efficiently iterate over combinations or permutations, and find ways to speed up our code. While this book didn't focus on honing these skills directly, here are some free resources for practicing with Python and thinking like a programmer:

While not free, Python Morsels (https://www.pythonmorsels.com/) provides weekly Python exercises that will help you learn to write more Pythonic code and get more familiar with the Python standard library. Exercises vary in difficulty but can be set to a higher...

Summary

This chapter provided you with many places where you can find datasets across myriad topics. In addition, you also learned about various websites where you can take courses and work through tutorials, practice machine learning, and improve your Python skills.It's important to keep your skills sharp and stay curious, so, for whatever interests you, look for data and perform your own analyses. These are things you can put on your GitHub account as your data portfolio.

Thank you for reading this book! I hope you got just as much out of it as these two data-analyzing pandas.

Exercises

The exercises in this chapter are open-ended—no solutions are provided. They are meant to give you some ideas so that you can get started on your own:

  1. Practice machine learning classification by participating in the Titanic challenge on Kaggle at https://www.kaggle.com/c/titanic.
  2. Practice machine learning regression techniques by participating in the housing prices challenge on Kaggle at https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
  3. Perform an analysis on something that interests you. Some interesting ideas include the following:

    a) Predicting likes on Instagram: https://towardsdatascience.com/predict-the-number-of-likes-on-instagram-a7ec5c020203

    b) Analyzing delays of NJ transit trains: https://medium.com/@pranavbadami/how-data-can-help-fix-nj-transit-c0d15c0660fe

    c) Using visualizations to solve data science problems: https://towardsdatascience.com/solving-a-data-science-challenge-the-visual-way-355cfabcb1c5

  4. Complete five...

Further reading

You can consult the following blogs and articles to stay up to date with Python and data science:

The following resources contain information for learning how to build custom scikit-learn...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Pandas - Second Edition
Published in: Apr 2021Publisher: PacktISBN-13: 9781800563452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin