You're reading from Building Data Science Solutions with Anaconda

Product type Book

Published in May 2022

Publisher Packt

ISBN-13 9781800568785

Pages 330 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Dan Meador

Table of Contents (16) Chapters

Preface

1. Part 1: The Data Science Landscape – Open Source to the Rescue

2. Chapter 1: Understanding the AI/ML landscape

3. Chapter 2: Analyzing Open Source Software

4. Chapter 3: Using the Anaconda Distribution to Manage Packages

5. Chapter 4: Working with Jupyter Notebooks and NumPy

6. Part 2: Data Is the New Oil, Models Are the New Refineries

7. Chapter 5: Cleaning and Visualizing Data

8. Chapter 6: Overcoming Bias in AI/ML

9. Chapter 7: Choosing the Best AI Algorithm

10. Chapter 8: Dealing with Common Data Problems

11. Part 3: Practical Examples and Applications

12. Chapter 9: Building a Regression Model with scikit-learn

13. Chapter 10: Explainable AI - Using LIME and SHAP

14. Chapter 11: Tuning Hyperparameters and Versioning Your Model

15. Other Books You May Enjoy

Understanding the massive generation of new data

Put simply, data is the fuel that powers all things in AI. The amount of data is staggering. In 2018, it was calculated that 90% of the world's data was created in the last 2 years. There is no reason to think that stat might hold no matter when you read this. But who cares? On its own, data means nothing without being able to use it.

Fracking, a new technique to open up new pockets of oil, has opened up access to previously unreachable areas. Without that, those energy reserves would have sat there doing nothing. This is exactly what AI does with data. It lets us tap into this previously useless holding of information in order to unlock its value.

Data is just a recording of a specific state of the world or an event at a specific time. The ability and costs to do this have followed the famous Moore's law, making it cheaper and quicker to store and retrieve a huge amount. Just look at the price of hard drives throughout the years, going from $3.5 million per GB in 1964, to about .02 cents today in 2021.

Moore's Law

From the famed CEO of Intel, Moore's law states that the number of transistors that can fit on a chip will double every 2 years. This is many times misquoted as 18 months, but that is actually a separate prediction from a fellow Intel employee, David House, based on power consumption. This could also have been a self-fulfilling prophecy, as that was the goal, and not just happenstance.

It turned out that this law applies to many things outside of just compute speed. The cost of many goods (especially tech) follows this. In automotive, TV cost/resolution, and many other fields, you will find a similar curve.

If you heard that your used Coke cans would be worth $10 once a new recycling factory was built, would you throw them in the garbage? That is similar to what all companies are hearing about their data. Even though they might be making use of it today, they are still collecting and storing everything they can in the hope that someday, it can be used. Storage is also cheap, and getting cheaper. Due to both of these, it is seen as a much better move to save this data as it could be much more valuable than the cost of storing it.

What data do you have that could be valuable? Consider HR hiring reports, the exact time of each customer purchase, or keywords in searches – each piece of data on its own might not give you much insight, but combined with the power ML gives you to find patterns, that data could have incredible value. Even if you don't see what could be done now, the message to companies is Just hang on to it, maybe there is a use for it. Because of this, companies have become data pack rats.

One movement that has led to a huge increase in data is the massive increase in the number of IoT devices. IoT stands for Internet of Things, and it is the concept that every day, normal devices are connected to the same internet that you get your email, YouTube, and Facebook from. Light switches, vacuums, and even fridges can be connected and send data which is collected by the manufacturer in order to improve their functionality (hopefully).

These seeming pinpricks of data combined to create 13.6 Zettabytes in 2019, and it's not slowing down. By 2025, there will be 79.4 Zettabytes! You will be hard-pressed to find new devices that aren't IoT-ready as companies are always looking to add that new feature to the latest offering. From a physical perspective, if each gigabyte in a zettabyte was a brick, the Great Wall of China could be made 258 times over with the 3,873,000,000 bricks you'd have. That's a lot of data to take care of and process!

New technologies and even software architecture patterns have been developed to handle all this data. Event-based architecture is a way to handle data that turns the traditional database model inside out. Instead of storing everything in a database, it has a stream of events, and anything that needs that data can reach into the stream and grab what they need. There is so much data that they don't even bother putting it in one place!

But more data isn't always the answer. There are many times that more data is the enemy. Datasets that have a large amount of incorrectly labeled data can make for a much more poorly trained model. This, of course, makes intuitive sense. Trying to teach a person or a computer something while giving them examples that aren't valid isn't going to get the output that you are looking for.

Let's look at a simple example of explaining to your child what a tiger is. You point out two large orange cats with black stripes and tell them Look, that's what a tiger looks like. You then point to an all-black cat and then say And look! There is another one, incorrectly telling your child that this other animal is also a tiger. You have now created a dataset that contains false positives, and could make it challenging for your child to learn what an actual tiger is. False positives might be an issue with the 3-value dataset, whereas false negatives might be an issue if they have just seen one.

Important Note

This is for example purposes only. Any model trained on just one data point is almost guaranteed to not provide very accurate end results. There is a field of study known as one-shot learning that attempts to work with just one data point, but this is generally found in vision problems.

You might also have an issue where the data being fed in doesn't resemble the live production data. Training data being as close as possible to test data is critical, so if, in our training example from before, you pointed out a swimming tiger from 300 ft away, your child might find it very challenging to identify one when they see one walking from 10 ft away in the zoo. More doesn't always equal better.

Having data is critical to the success of AI, but the true driving force behind its adoption is what it can do for the world of business, such as Netflix recommending shows you will like, Google letting advertisers get their business in front of the right people, and Amazon showing you other products to fill your cart. This all allows businesses to scale like no other technique or approach out there and helps them continue to dominate in their space.

You're reading from Building Data Science Solutions with Anaconda

Table of Contents (16) Chapters

Understanding the massive generation of new data

Authors (1)

Personalised recommendations for you