Reader small image

You're reading from  Building Data Science Solutions with Anaconda

Product typeBook
Published inMay 2022
PublisherPackt
ISBN-139781800568785
Edition1st Edition
Concepts
Right arrow
Author (1)
Dan Meador
Dan Meador
author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Right arrow

Preface

When Marc Andreessen (https://www.crunchbase.com/person/marc-andreessen) wrote his famous article Why Software Is Eating The World in the Wall Street Journal, https://bit.ly/MarcAndreessen, he described a reality in which every company would be required to become a software company. The power of software was too great, its reach too vast. Companies could ignore it at their own peril. We are at the same inflection point now with Artificial Intelligence (AI).

There is a complexity to the field of AI that makes it both daunting for newcomers but also challenging for those already in it to ensure they have all the different areas covered. Aspects such as bias in models and data, interpretability/explainability, and even managing data science packages can be skills that aren't understood, even though they are critical in being able to build AI systems that will power our world. These concepts and more are no longer going to be optional. Too many resources leave this and many other areas of practical data science out.

After you are done reading this book, you'll wonder how anyone can be in this field and not have an understanding of core concepts such as proximity bias, using Anaconda Distribution, and how Shapley values tell you how features influence a model. All of this is knowledge that you will soon possess. We'll focus on the pragmatic and applicable as we use analogies to solidify your understanding. By the end, you'll be well positioned to take your knowledge of data science to the next level.

Who this book is for

This book is for anyone that not only wants to better understand the world of data science but also those that have a decent grasp and want to become more well rounded in their knowledge on things such as Anaconda tools and Open Source Software (OSS). Assume that you don't have a grasp of areas such as bias or interpretability and that you still don't know all the various types of algorithms you can use to create AI/ML models. We've designed this book to be as self-contained as possible, so you'll only need outside resources when you want to go deeper.

Some basic technical knowledge is expected, but being a developer or even knowing much about data science is not a necessity. You can read this book from beginning to end, or you can jump to the chapters that seem most relevant to you. While each chapter does build on the previous ones, we have structured it in such a way that you won't be lost if you choose to navigate to a specific topic.

What this book covers

Chapter 1, Understanding the AI/ML Landscape, provides an overview of the current state of data science as well as what tools you'll need to succeed.

Chapter 2, Analyzing Open Source Software, delves into the role of OSS in data science and how to decide what new OSS tool to use. You'll get a systematic checklist to look for in the next tool you evaluate.

Chapter 3, Using Anaconda Distribution to Manage Packages, covers how to manage packages with conda and Navigator. This includes how to create environments and create channels.

Chapter 4, Working with Jupyter Notebooks and NumPy, covers how to successfully turn notebooks into your daily driver to create data science value. We'll also go deeper into the powerful NumPy library to vastly speed up our operations.

Chapter 5, Cleaning and Visualizing Data, looks at the core techniques you'll need to shape data coming in to prepare it for model training. We'll cover areas such as imputing and also how we can visualize our data to gain a greater understanding.

Chapter 6, Overcoming Bias in AI/ML, looks at the many ways that naive ignorance can be present in our data and what we can do to avoid or correct these issues. You'll see what the real-world impacts are of a biased AI model.

Chapter 7, Choosing the Best AI Algorithm, goes into some of the major problem families that AI/ML models can help with, including regression and anomaly detection. We'll check out the algorithms you can use as well as the comparative rating for each.

Chapter 8, Dealing with Common Data Problems, looks at how you can identify and correct errors in your datasets, such as incorrect data entries. You'll also see how to scale your data and encode categorical features.

Chapter 9, Building a Regression Model with scikit-learn, walks you through a complete flow of building a regression model and how you can evaluate the results.

Chapter 10, Explainable AI – Using LIME and SHAP, goes further into the results of a model to be able to interpret and also explain how a model arrived at the results it did. Models that are interpretable by design and black-box models are covered.

Chapter 11, Tuning Hyperparameters with scikit-learn Pipelines, takes a more holistic approach and shows you how to leverage pipelines to create a flexible and repeatable process for data preparation and model creation. We'll cover how to use these tools to tune your hyperparameters to create a better model.

To get the most out of this book

All the software used in this book is open source, meaning you will not have to pay for any of it. If you do find it useful, you are encouraged to find ways to give back to these communities, either financially or by contributing to the code base directly. NumFOCUS sponsors many of the tools used; you can find more about them at their website: https://numfocus.org/.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Data-Science-Solutions-with-Anaconda. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800568785_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

from sklearn.model_selection import train_test_split
training_data =	cali_data.data
target_value =	cali_data.target
X_train, X_test, y_train, y_test = train_test_ split(training_data, target_value, test_size = 0.2,
random_state=5)

Any command-line input or output is written as follows:

conda install numpy

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "At the top right of the screen, there will be a Fork button."

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Building Data Science Solutions with Anaconda, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022Publisher: PacktISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador