Home Data Practical Data Science with Python

Practical Data Science with Python

By Nathan George
ai-assist-svg-icon Book + AI Assistant
eBook + AI Assistant $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
eBook + AI Assistant $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Introduction to Data Science
About this book
Practical Data Science with Python teaches you core data science concepts, with real-world and realistic examples, and strengthens your grip on the basic as well as advanced principles of data preparation and storage, statistics, probability theory, machine learning, and Python programming, helping you build a solid foundation to gain proficiency in data science. The book starts with an overview of basic Python skills and then introduces foundational data science techniques, followed by a thorough explanation of the Python code needed to execute the techniques. You'll understand the code by working through the examples. The code has been broken down into small chunks (a few lines or a function at a time) to enable thorough discussion. As you progress, you will learn how to perform data analysis while exploring the functionalities of key data science Python packages, including pandas, SciPy, and scikit-learn. Finally, the book covers ethics and privacy concerns in data science and suggests resources for improving data science skills, as well as ways to stay up to date on new data science developments. By the end of the book, you should be able to comfortably use Python for basic data science projects and should have the skills to execute the data science process on any data source.
Publication date:
September 2021
Publisher
Packt
Pages
620
ISBN
9781801071970

 

Introduction to Data Science

Data science is a thriving and rapidly expanding field, as you probably already know. People are starting to come to a consensus that everyone should have some basic data science skills, sometimes called "data literacy." This book is intended to get you up to speed with the basics of data science using the most popular programming language for doing data science today: Python. In this first chapter, we will cover:

  • The history of data science
  • The top tools and skills used in data science, and why these are used
  • Specializations within and related to data science
  • Best practices for managing a data science project

Data science is used in a variety of ways. Some data scientists focus on the analytics side of things, pulling out hidden patterns and insights from data, then communicating these results with visualizations and statistics. Others work on creating predictive models in order to predict future events, such as predicting whether someone will put solar panels on their house. Yet others work on models for classification; for example, classifying the make and model of a car in an image. One thing ties all applications of data science together: the data. Anywhere you have enough data, you can use data science to accomplish things that seem like magic to the casual observer.

 

The data science origin story

There's a saying in the data science community that's been around for a while, and it goes: "A data scientist is better than any computer scientist at statistics, and better than any statistician at computer programming." This encapsulates the general skills of most data scientists, as well as the history of the field.

Data science combines computer programming with statistics, and some even call data science applied statistics. Conversely, some statisticians think data science is only statistics. So, while we might say data science dates back to the roots of statistics in the 19th century, the roots of modern data science actually begin around the year 2000. At this time, the internet was beginning to bloom, and with it, the advent of big data. The amount of data generated from the web resulted in the new field of data science being born.

A brief timeline of key historical data science events is as follows:

  • 1962: John Tukey writes The Future of Data Analysis, where he envisions a new field for learning insights from data
  • 1977: Tukey publishes the book Exploratory Data Analysis, which is a key part of data science today
  • 1991: Guido Van Rossum publishes the Python programming language online for the first time, which goes on to become the top data science language used at the time of writing
  • 1993: The R programming language is publicly released, which goes on to become the second most-used data science general-purpose language
  • 1996: The International Federation of Classification Societies holds a conference titled "Data Science, Classification and Related Methods" – possibly the first time "data science" was used to refer to something similar to modern data science
  • 1997: Jeff Wu proposes renaming statistics "data science" in an inauguration lecture at the University of Michigan
  • 2001: William Cleveland publishes a paper describing a new field, "data science," which expands on data analysis
  • 2008: Jeff Hammerbacher and DJ Patil use the term "data scientist" in job postings after trying to come up with a good job title for their work
  • 2010: Kaggle.com launches as an online data science community and data science competition website
  • 2010s: Universities begin offering masters and bachelor's degrees in data science; data science job postings explode to new heights year after year; big breakthroughs are made in deep learning; the number of data science software libraries and publications burgeons.
  • 2012: Harvard Business Review publishes the notorious article entitled Data Scientist: The Sexiest Job of the 21st Century, which adds fuel to the data science fire.
  • 2015: DJ Patil becomes the chief data scientist of the US for two years.
  • 2015: TensorFlow (a deep learning and machine learning library) is released.
  • 2018: Google releases cloud AutoML, democratizing a new automatic technique for machine learning and data science.
  • 2020: Amazon SageMaker Studio is released, which is a cloud tool for building, training, deploying, and analyzing machine learning models.

We can make a few observations from this timeline. For one, the idea of data science was around for several decades before it became wildly popular. People foresaw that future society would need something like data science, but it wasn't until the amount of digital data became so widespread and easily accessible that data science could actually be used productively. We also note that the two most widely used programming languages in data science, Python and R, existed for 15 years before the field of data science existed in earnest, after which they rapidly took off in use as data science languages.

There is another trend happening in data science, which is the rise of data science competitions. The first online data science competition organization was Kaggle.com in 2010. Since then, they have been acquired by Google and continue to grow. Kaggle offers cash prizes for machine learning competitions (often 10k USD or more), and also has a large community of data science practitioners and learners. Several other websites have appeared and run data science competitions, often with cash prizes as well. Looking at other people's code (especially the winners' code if available) can be a good way to learn new data science techniques and tricks. Here are most of the current websites with data science competitions:

  • Kaggle
  • Analytics Vidhya
  • HackerRank
  • DrivenData (focused on social justice)
  • AIcrowd
  • CodaLab
  • Topcoder
  • Zindi
  • Tianchi
  • Several other specialized competitions, like Microsoft's COCO

A couple of websites that list data science competitions are:

ods.ai

www.mlcontests.com

Shortly after Kaggle was launched in 2010, universities started offering master's and then bachelor's degrees in data science. At the same time, a plethora of online resources and books have been released, teaching data science in a variety of ways.

As we can see, in the late 2010s and early 2020s, some aspects of data science started to become automated. This scares people who think data science might become fully automated soon. While some aspects of data science can be automated, it is still necessary to have someone with the data science know-how in order to properly use automated data science systems. It's also useful to have the skills to do data science from scratch by writing code, which offers ultimate flexibility. A data scientist is also still needed for a data science project in order to understand business requirements, implement data science products in production, and communicate the results of data science work to others.

Automated data science tools include automatic machine learning (AutoML) through Google Cloud, Amazon's AWS, Azure, H2O, and more. With AutoML, we can screen several machine learning models quickly in order to optimize predictive performance. Automated data cleaning is also being developed. At the same time that this automation is happening, we are also seeing a desire by companies to build "data literacy" among their employees. This "data literacy" means understanding some basic statistics and data science techniques, such as utilizing modern digital data and tools to benefit the organization by converting data into information. Practically speaking, this means we can take data from an Excel spreadsheet or database and create statistical visualizations and machine learning models to extract meaning from the data. In more advanced cases, this can mean creating predictive machine learning models that are used to guide decision making or can be sold to customers.

As we move into the future with data science, we will likely see an expansion of the toolsets available and automation of mundane work. We also anticipate organizations will increasingly expect their employees to have "data literacy" skills, including basic data science knowledge and techniques.

This should help organizations make better data-driven decisions, improve their bottom lines, and be able to utilize their data more effectively.

If you're interested in reading further on the history, composition, and others' thoughts of data science, David Donoho's paper 50 Years of Data Science is a great resource. The paper can be found here:

http://courses.csail.mit.edu/18.337/2016/docs/50YearsDataScience.pdf

 

The top data science tools and skills

Drew Conway is famous for his data science Venn diagram from 2010, postulating that data science is a combination of hacking skills (programming/coding), math and statistics, and domain expertise. I'd also add business acumen and communications skills to the mix, and state that sometimes, domain expertise isn't really required upfront. To utilize data science effectively, we should know how to program, know some math/statistics, know how to solve business problems with data science, and know how to communicate results.

Python

In the field of data science, Python is king. It's the main programming language and tool for carrying out data science. This is in large part due to network effects, meaning that the more people that use Python, the better a tool Python becomes. As the Python network and technology grows, it snowballs and becomes self-reinforcing. The network effects arise due to the large number of libraries and packages, related uses of Python (for example, DevOps, cloud services, and serving websites), the large and growing community around Python, and Python's ease of use. Python and the Python-based data science libraries and packages are free and open source, unlike many GUI solutions (like Excel or RapidMiner).

Python is a very easy-to-learn language and is easy to use. This is in large part due to the syntax of Python – there aren't a lot of brackets to keep track of (like in Java), and the overall style is clean and simple. The core Python team also published an official style guide, PEP 8, which states that Python is meant to be easy to read (and hence, easy to write). The ease of learning and using Python means more people can join the Python community faster, growing the network.

Since Python has been around a while, there has been sufficient time for people to build up convenient libraries to take care of tasks that used to be tedious and involve lots of work. An example is the Seaborn package for plotting, which we will cover in Chapter 5, Exploratory Data Analysis and Visualization. In the early 2000s, the primary way to make plots in Python was with the Matplotlib package, which can be a bit painstaking to use at times. Seaborn was created around 2013 and abstracts several lines of Matplotlib code into single commands. This has been the case across the board for Python in data science. We now have packages and libraries to do all sorts of things, like AutoML (H2O, AutoKeras), plotting (Seaborn, Plotly), interacting with the cloud via software development kits or SDKs (Boto3 for AWS, Microsoft's Azure SDKs), and more. Contrast this with another top data science language, R, which does not have quite as strong network effects. AWS does not offer an official R SDK, for example, although there is an unofficial R SDK.

Similar to the variety of packages and libraries are all the ways to use Python. This includes the many distributions for installing Python, like Anaconda (which we'll use in this book). These Python distributions make installing and managing Python libraries easy and convenient, even across a wide variety of operating systems. After installing Python, there are several ways to write and interact with Python code in order to do data science. This includes the notorious Jupyter Notebook, which was first created exclusively for Python (but now can be used with a plethora of programming languages). There are many choices for integrated development environments (IDEs) for writing code; in fact, we can even use the RStudio IDE to write Python code. Many cloud services also make it easy to use Python within their platforms.

Lastly, the large community makes learning Python and writing Python code much easier. There is a huge number of Python tutorials on the web, thousands of books involving Python, and you can easily get help from the community on Stack Overflow and other specialized online support communities. We can see from the 2020 Kaggle data scientist survey results below in Figure 1.1 that Python was found to be the most-used language for machine learning and data science. In fact, I've used it to create most of the figures in this chapter! Although Python has some shortcomings, it has enormous momentum as the main data science programming language, and this doesn't appear to be changing any time soon.

Figure 1.1: The results from the 2020 Kaggle data science survey show Python is the top programming language used for data science, followed by SQL, then R, then a host of other languages.

Other programming languages

Many other programming languages for data science exist, and sometimes they are best to use for certain applications. Much like choosing the right tool to repair a car or bicycle, choosing the correct programming tool can make life much easier. One thing to keep in mind is that programming languages can often be intermixed. For example, we can run R code from within Python, or vice versa.

Speaking of R, it's the next-biggest general-purpose programming language for data science after Python. The R language has been around for about as long as Python, but originated as a statistics-focused language rather than a general-purpose programming language like Python. This means with R, it is often easier to implement classic statistical methods, like t-tests, ANOVA, and other statistical tests. The R community is very welcoming and also large, and any data scientist should really know the basics of how to use R. However, we can see that the Python community is larger than R's community from the number of Stack Overflow posts shown below in Figure 1.2 – Python has about 10 times more posts than R. Programming in R is enjoyable, and there are several libraries that make common data science tasks easy.

Figure 1.2: The number of Stack Overflow questions by programming language over time. The y-axis is a log scale since the number of posts is so different between less popular languages like Julia and more popular languages like Python and R.

Another key programming language in data science is SQL. We can see from the Kaggle machine learning and data science survey results (Figure 1.1) that SQL is actually the second most-used language after Python. SQL has been around for decades and is necessary for retrieving data from SQL databases in many situations. However, SQL is specialized for use with databases, and can't be used for more general-purpose tasks like Python and R can. For example, you can't easily serve a website with SQL or scrape data from the web with SQL, but you can with R and Python.

Scala is another programming language sometimes used for data science and is most often used in conjunction with Spark, which is a big data processing and analytics engine. Another language to keep on your radar is Julia. This is a relatively new language but is gaining popularity rapidly. The goal of Julia is to overcome Python's shortcomings while still making it an easy-to-learn and easy-to-use language. Even if Julia does eventually replace Python as the top data science language, it probably won't be for several years or decades. Julia runs calculations faster than Python, runs in parallel by default, and is useful for large-scale simulations such as global climate simulations. However, Julia lacks the robust infrastructure, network, and community that Python has.

Several other languages can be used for data science as well, like JavaScript, Go, Haskell, and others. All of these programming languages are free and open source, like Python. However, all of these other languages lack the large data science ecosystems that Python and R have, and some of them are difficult to learn. For certain specialized tasks, these other languages can be great. But in general, it's best to keep it simple at first and stick with Python.

GUIs and platforms

There are a plethora of graphical user interfaces (GUIs) and data science or analytics platforms. In my opinion, the biggest GUI used for data science is Microsoft Excel. It's been around for decades and makes analyzing data simple. However, as with all GUIs, Excel lacks flexibility. For example, you can't create a boxplot in Excel with a log scale on the y-axis (we will cover boxplots and log scales in Chapter 5, Exploratory Data Analysis and Visualization). This is always the trade-off between GUIs and programming languages – with programming languages, you have ultimate flexibility, but this usually requires more work. With GUIs, it can be easier to accomplish the same thing as with a programming language, but one often lacks the flexibility to customize techniques and results. Some GUIs like Excel also have limits to the amount of data they can handle – for example, Excel can currently only handle about 1 million rows per worksheet.

Excel is essentially a general-purpose data analytics GUI. Others have created similar GUIs, but more focused on data science or analytics tasks. For example, Alteryx, RapidMiner, and SAS are a few. These aim to incorporate statistical and/or data science processes within a GUI in order to make these tasks easier and faster to accomplish. However, we again trade customizability for ease of use. Most of these GUI solutions also cost money on a subscription basis, which is another drawback.

The last types of GUIs related to data science are visualization GUIs. These include tools like Tableau and QlikView. Although these GUIs can do a few other analytics and data science tasks, they are focused on creating interactive visualizations.

Many of the GUI tools have capabilities to interface with Python or R scripts, which enhances their flexibility. There is even a Python-based data science GUI called "Orange," which allows one to create data science workflows with a GUI.

Cloud tools

As with many things in technology today, some parts of data science are moving to the cloud. The cloud is most useful when we are working with big datasets or need to be able to rapidly scale up. Some of the major cloud providers for data science include:

  • Amazon Web Services (AWS) (general purpose)
  • Google Cloud Platform (GCP) (general purpose)
  • Microsoft Azure (general purpose)
  • IBM (general purpose)
  • Databricks (data science and AI platform)
  • Snowflake (data warehousing)

We can see from Kaggle's 2020 machine learning and data science survey results in Figure 1.3 that AWS, GCP, and Azure seem to be the top cloud resources used by data scientists.

Figure 1.3: The results from the 2020 Kaggle data science survey showing the most-used cloud services

Many of these cloud services have software development kits (SDKs) that allow one to write code to control cloud resources. Almost all cloud services have a Python SDK, as well as SDKs in other languages. This makes it easy to leverage huge computing resources in a reproducible way. We can write Python code to provision cloud resources (called infrastructure as code, or IaC), run big data calculations, assemble a report, and integrate machine learning models into a production product. Interacting with cloud resources via SDKs is an advanced topic, and one should ideally learn the basics of Python and data science before trying to leverage the cloud to run data science workflows. Even when using the cloud, it's best to prototype and test Python code locally (if possible) before deploying it to the cloud and spending resources.

Cloud tools can also be used with GUIs, such as Microsoft's Azure Machine Learning Studio and AWS's SageMaker Studio. This makes it easy to use the cloud with big data for data science. However, one must still understand data science concepts, such as data cleaning caveats and hyperparameter tuning, in order to properly use data science cloud resources for data science. Not only that, but data science GUI platforms on the cloud can suffer from the same problems as running a local GUI on your machine – sometimes GUIs lack the flexibility to do exactly what you want.

Statistical methods and math

As we learned, data science was born out of statistics and computer science. A good understanding of some core statistical methods is a must for doing data science. Some of these essential statistical skills include:

  • Exploratory analysis statistics (exploratory data analysis, or EDA), like statistical plotting and aggregate calculations such as quantiles
  • Statistical tests and their principles, like p-values, chi-squared tests, t-tests, and ANOVA
  • Machine learning modeling, including regression, classification, and clustering methods
  • Probability and statistical distributions, like Gaussian and Poisson distributions

With statistical methods and models, we can do amazing things like predict future events and uncover hidden patterns in data. Uncovering these patterns can lead to valuable insights that can change the way businesses operate and improve the bottom line, or improve medical diagnoses among other things..

Although an extensive mathematics background is not required, it's helpful to have an analytical mindset. A data scientist's capabilities can be improved by understanding mathematical techniques such as:

  • Geometry (for example, distance calculations like Euclidean distance)
  • Discrete math (for calculating probabilities)
  • Linear algebra (for neural networks and other machine learning methods)
  • Calculus (for training/optimizing some models, especially neural networks)

Many of the more difficult aspects of these mathematical techniques are not required for doing the majority of data science. For example, knowing linear algebra and calculus is most useful for deep learning (neural networks) and computer vision, but not required for most data science work.

Collecting, organizing, and preparing data

Most data scientists spend somewhere between 25% and 75% of their time cleaning and preparing data, according to a 2016 Crowdflower survey and a 2018 Kaggle survey. However, anecdotal evidence suggests many data scientists spend 90% or more of their time cleaning and preparing data. This varies depending on how messy and disorganized the data is, but the fact of the matter is that most data is messy. For example, working with thousands of Excel spreadsheets with different formats and lots of quirks takes a long time to clean up. But loading a CSV file that's already been cleaned is nearly instantaneous. Data loading, cleaning, and organizing are sometimes called data munging or data wrangling (also sometimes referred to as data janitor work). This is often done with the pandas package in Python, which we'll learn about in Chapter 4, Loading and Wrangling Data with Pandas and NumPy.

Software development

Programming skills like Python are encompassed by software development, but there is another set of software development skills that are useful to have. This includes code versioning with tools like Git and GitHub, creating reproducible and scalable software products with technologies such as Docker and Kubernetes, and advanced programming techniques. Some people say data science is becoming more like software engineering, since it has started to involve more programming and deployment of machine learning models at scale in the cloud. Software development skills are always good to have as a data scientist, and some of these skills are required for many data science jobs, like knowing how to use Git and GitHub.

Business understanding and communication

Lastly, our data science products and results are useless if we can't communicate them to others. Communication often starts with understanding the problem and audience, which involves business acumen. If you know what risks and opportunities businesses face, then you can frame your data science work through that lens. Communication of results can then be accomplished with classic business tools like Microsoft PowerPoint, although other new tools such as Jupyter Notebook (with add-ons such as reveal.js) can be used to create more interactive presentations as well. Using a Jupyter Notebook to create a presentation allows one to actively demo Python or other code during the presentation, unlike classic presentation software.

 

Specializations in and around data science

Although many people desire a job with the title "data scientist," there are several other jobs and functions out there that are related and sometimes almost the same as data science. An ideal data scientist would be a "unicorn" and encompass all of these skills and more.

Machine learning

Machine learning is a major part of data science, and there are even job titles for people specializing in machine learning called "machine learning engineer" or similar. Machine learning engineers will still use other data science techniques like data munging but will have extensive knowledge of machine learning methods. The machine learning field is also moving toward "deployment," meaning the ability to deploy machine learning models at scale. This most often uses the cloud with application programming interfaces (APIs), which allows software engineers or others to access machine learning models, as is often called MLOps. However, one cannot deploy machine learning models well without knowing the basics of machine learning first. A data scientist should have machine learning knowledge and skills as part of their core skillset.

Business intelligence

The business intelligence (BI) field is closely related to data science and shares many of the same techniques. BI is often less technical than other data science specializations. While a machine learning specialist might get into the nitty-gritty details of hyperparameter tuning and model optimization, a BI specialist will be able to utilize data science techniques like analytics and visualization, then communicate to an organization what business decisions should be made. BI specialists may use GUI tools in order to accomplish data science tasks faster and will utilize code with Python or SQL when more customization is needed. Many aspects of BI are included in the data science skillset.

Deep learning

Deep learning and neural networks are almost synonymous; "deep learning" simply means using large neural networks. For almost all applications of neural networks in the modern world, the size of the network is large and deep. These models are often used for image recognition, speech recognition, language translation, and modeling other complex data.

The boom in deep learning took off in the 2000s and 2010s when GPUs rapidly increased in computing power, following Moore's Law. This enabled more powerful software applications to harness GPUs, like computer vision, image recognition, and language translation. The software developed for GPUs took off exponentially, such that in the 2020s, we have a plethora of Python and other libraries for running neural networks.

The field of deep learning has academic roots, and people spend four years or longer studying deep learning during their PhDs. Becoming an expert in deep learning takes a lot of work and a long time. However, one can also learn how to harness neural networks and deploy them using cloud resources, which is a very valuable skill. Many start-ups and companies need people who can create neural network models for image recognition applications. Basic knowledge of deep learning is necessary as a data scientist, although deep expertise is rarely required. Simpler models, like linear regression or boosted tree models, can often be better than deep learning models for reasons including computational efficiency and explainability.

Data engineering

Data engineers are like data plumbers, but if that sounds boring, don't let that fool you – data engineering is actually an enjoyable and fun job. Data engineering encompasses skills often used in the first steps of the data science process. These are tasks like collecting, organizing, cleaning, and storing data in databases, and are the sorts of things that data scientists spend a large fraction of their time on. Data engineers have skills in Linux and the command line, similar to DevOps folks. Data engineers are also able to deploy machine learning models at scale like machine learning engineers, but a data engineer usually doesn't have as much extensive knowledge of ML models as an ML engineer or general data scientist. As a data scientist, one should know basic data engineering skills, such as how to interact with different databases through Python and how to manipulate and clean data.

Big data

Big data and data engineering overlap somewhat. Both specializations need to know about databases and how to interact with them and use them, as well as how to use various cloud technologies for working with big data. However, a big data specialist should be an expert in the Hadoop ecosystem, Apache Spark, and cloud solutions for big data analytics and storage. These are the top tools used for big data. Spark began to overtake Hadoop in the late 2010s, as Spark is better suited for the cloud technologies of today.

However, Hadoop is still used in many organizations, and aspects of Hadoop, like the Hadoop Distributed File System (HDFS), live on and are used in conjunction with Spark. In the end, a big data specialist and data engineer tend to do very similar work.

Statistical methods

Statistical methods, like the ones we will learn about in Chapters 8 and 9, can be a focus area for data scientists. As we already mentioned, statistics is one of the fields from which data science evolved. A specialization in statistics will likely utilize other software such as SPSS, SAS, and the R programming language to run statistical analyses.

Natural Language Processing (NLP)

Natural language processing (NLP) involves using programming languages to understand human language as writing and speech. Usually, this involves processing and modeling text data, often from social media or large amounts of text data. In fact, one subspecialization within NLP is chatbots. Other aspects of NLP include sentiment analysis and topic modeling. Modern NLP also has overlaps with deep learning, since many NLP methods now use neural networks.

Artificial Intelligence (AI)

Artificial intelligence (AI) encompasses machine learning and deep learning, and often cloud technologies for deployment. Jobs related to AI have titles like "artificial intelligence engineer" and "artificial intelligence architect." This specialization overlaps with machine learning, deep learning, and NLP quite a lot. However, there are some specific AI methods, such as pathfinding, that are useful for fields such as robotics.

Choosing how to specialize

First, realize that you don't need to choose a specialization – you can stick with the general data science track. However, having a specialization can make it easier to land a job in that field. For example, you'd have an easier time getting a job as a big data engineer if you spent a lot of time working on Hadoop, Spark, and cloud big data projects. In order to choose a specialization, it helps to first learn more about what the specialization entails, and then practice it by carrying out a project that uses that specialization.

It's a good idea to try out some of the tools and technologies in the different specializations, and if you like a specialization, you might stick with it. We will learn some of the tools and techniques for the specializations above except for deep learning and big data. So, if you find yourself enjoying the machine learning topic quite a bit, you might explore that specialization more by completing some projects within machine learning. For example, a Kaggle competition can be a good way to try out a machine learning focus within data science. You might also look into a specialized book on the topic to learn more, such as Interpretable Machine Learning with Python by Serg Masis from Packt. Additionally, you might read about and learn some MLOps.

If you know you like communicating with others and have experience and enjoy using GUI tools such as Alteryx and Tableau, you might consider the BI specialization. To practice this specialization, you might take some public data from Kaggle or a government website (such as data.gov) and carry out a BI project. Again, you might look into a book on the subject or a tool within BI, such as Mastering Microsoft Power BI by Brett Powell from Packt. Deep learning is a specialization that many enjoy but is very difficult. Specializing in neural networks takes years of practice and study, although start-ups will hire people with less experience. Even within deep learning there are sub-specializations – image recognition, computer vision, sound recognition, recurrent neural networks, and more. To learn more about this specialization and see if you like it, you might start with some short online courses such as Kaggle's courses at https://www.kaggle.com/learn/. You might then look into further reading materials such as Deep Learning for Beginners by Pablo Rivas from Packt. Other learning and reading materials on deep learning exist for the specialized libraries, including TensorFlow/Keras, PyTorch, and MXNet.

Data engineering is a great specialization because it is expected to experience rapid growth in the near future, and people tend to enjoy the work. We will get a taste of data engineering when we deal with data in Chapters 4, 6, and 7, but you might want to learn more about the subject if you're interested from other materials such as Data Engineering with Python by Paul Crickard from Packt.

With big data specialization, you might look into more learning materials such as the many books within Packt that cover Apache Spark and Hadoop, as well as cloud data warehousing. As mentioned earlier, the big data and data engineering specializations have significant overlap. However, specialization in data engineering would likely be better for landing a job in the near future. Statistics as a specialization is a little trickier to try out, because it can rely on using specialized software such as SPSS and SAS. However, you can try out several of the statistics methods available in R for free, and can learn more about that specialization to see if you like it with one of the many R statistics books by Packt.

NLP is a fun specialization, but like deep learning, it takes a long time to learn. We will get a taste of NLP in Chapter 17, but you can also try the spaCy course here: https://course.spacy.io/en/. The book Hands-On Natural Language Processing with Python by Rajesh Arumugam and Rajalingappaa Shanmugamani is also a good resource to learn more about the subject.

Finally, AI is an interesting specialization that you might consider. However, it can be a broad specialization, since it can include aspects of machine learning, deep learning, NLP, cloud technologies, and more. If you enjoy machine learning and deep learning, you might look into learning more about AI to see if you'd be interested in specializing in it. Packt has several books on AI, and there is also the book Artificial Intelligence: Foundations of Computational Agents by David L. Poole and Alan K. Mackworth, which is free online at https://artint.info/2e/html/ArtInt2e.html.

If you choose to specialize in a field, realize that you can peel off into a parallel specialization. For example, data engineering and big data are highly related, and you could easily switch from one to another. On the other hand, machine learning, AI, and deep learning are rather related and could be combined or switched between. Remember that to try out a specialization, it helps to first learn about it from a course or book, and then try it out by carrying out a project in that field.

 

Data science project methodologies

When working on a large data science project, it's good to organize it into a process of steps. This especially helps when working as a team. We'll discuss a few data science project management strategies here. If you're working on a project by yourself, you don't necessarily need to exactly follow every detail of these processes. However, seeing the general process will help you think about what steps you need to take when undertaking any data science task.

Using data science in other fields

Instead of focusing primarily on data science and specializing there, one can also use these skills for their current career path. One example is using machine learning to search for new materials with exceptional properties, such as superhard materials (https://par.nsf.gov/servlets/purl/10094086) or using machine learning for materials science in general (https://escholarship.org/uc/item/0r27j85x). Again, anywhere we have data, we can use data science and related methods.

CRISP-DM

CRISP-DM stands for Cross-Industry Standard Process for Data Mining and has been around since the late 1990s. It's a six-step process, illustrated in the diagram below.

Figure 1.4: A reproduction of the CRISP-DM process flow diagram

This was created before data science existed as its own field, although it's still used for data science projects. It's easy to roughly implement, although the official implementation requires lots of documentation. The official publication outlining the method is also 60 pages of reading. However, it's at least worth knowing about and considering if you are undertaking a data science project.

TDSP

TDSP, or the Team Data Science Process, was developed by Microsoft and launched in 2016. It's obviously much more modern than CRISP-DM, and so is almost certainly a better choice for running a data science project today.

The five steps of the process are similar to CRISP-DM, as shown in the figure below.

Figure 1.5: A reproduction of the TDSP process flow diagram

TDSP improves upon CRISP-DM in several ways, including defining roles for people within the process. It also has modern amenities, such as a GitHub repository with a project template and more interactive web-based documentation. Additionally, it allows more iteration between steps with incremental deliverables and uses modern software approaches to project management.

Further reading on data science project management strategies

There are other data science project management strategies out there as well. You can read about them at https://www.datascience-pm.com/.

You can find the official guide for CRISP-DM here:

https://www.the-modeling-agency.com/crisp-dm.pdf

And the guide for TDSP is here:

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

Other tools

Other tools used by data scientists include Kanban boards, Scrum, and the Agile software development framework. Since data scientists often work with software engineers to implement data science products, many of the organizational processes from software engineering have been adopted by data scientists.

 

Test your knowledge

To help you remember what you just learned, try answering the following questions. Try to answer the questions without looking back at the answers in the chapter at first. The answer key is included in the GitHub repository for this book (https://github.com/PacktPublishing/Practical-Data-Science-with-Python).

  1. What are the top three data science programming languages, in order, according to the 2020 Kaggle data science and machine learning survey?
  2. What is the trade-off between using a GUI versus using a programming language for data science? What are some of the GUIs for data science that we mentioned?
  3. What are the top three cloud providers for data science and machine learning according to the Kaggle 2020 survey?
  4. What percentage of time do data scientists spend cleaning and preparing data?
  5. What specializations in and around data science did we discuss?
  6. What data science project management strategies did we discuss, and which one is the most recent? What are their acronyms and what do the acronyms stand for?
  7. What are the steps in the two data science project management strategies we discussed? Try to draw the diagrams of the strategies from memory.
 

Summary

You should now have a basic understanding of how data science came to be, what tools and techniques are used in the field, specializations in data science, and some strategies for managing data science projects. We saw how the ideas behind data science have been around for decades, but data science didn't take off until the 2010s. It was in the 2000s and 2010s that the deluge of data from the internet coupled with high-powered computers enabled us to carry out useful analysis on large datasets.

We've also seen some of the skills we'll need to learn to do data science, many of which we will tackle throughout this book. Among those skills are Python and general programming skills, software development skills, statistics and mathematics for data science, business knowledge and communication skills, cloud tools, machine learning, and GUIs.

We've seen some specializations in data science as well, like machine learning and data engineering. Lastly, we looked at some data science project management strategies that can help organize a team data science project.

Now that we know a bit about data science, we can learn about the lingua franca of data science: Python.

About the Author
  • Nathan George

    Nathan George is a data scientist at Tink in Stockholm, Sweden, and taught data science as a professor at Regis University in Denver, CO for over 4 years. Nathan has created online courses on Pythonic data science and uses Python data science tools for electroencephalography (EEG) research with the OpenBCI headset and public EEG data. His education includes the Galvanize data science immersive, a PhD from UCSB in Chemical Engineering, and a BS in Chemical Engineering from the Colorado School of Mines.

    Browse publications by this author
Latest Reviews (1 reviews total)
I have not received my book.
Practical Data Science with Python
Unlock this book and the full library FREE for 7 days
Start now