Reader small image

You're reading from  Artificial Intelligence with Python - Second Edition

Product typeBook
Published inJan 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839219535
Edition2nd Edition
Languages
Right arrow
Author (1)
Prateek Joshi
Prateek Joshi
author image
Prateek Joshi

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.
Read more about Prateek Joshi

Right arrow

Artificial Intelligence and Big Data

In this chapter, we are going to learn what big data is and how big data technologies can be used in the context of artificial intelligence. We will discuss how big data can help accelerate machine learning pipelines. We will also discuss when it is a good idea to use big data techniques and when they are overkill, using some examples to further our understanding. We will learn about the building blocks of a machine learning pipeline that uses big data and the various challenges involved, and we will create an environment in Python to see how it works in practice. By the end of this chapter, we will have covered:

  • Big data basics
  • The three V's of big data
  • Big data as it applies to artificial intelligence and machine learning
  • A machine learning pipeline using big data
  • Apache Hadoop
  • Apache Spark
  • Apache Impala
  • NoSQL databases

Let's begin with the basics of big data.

Big data basics

There is an activity that you regularly perform today that you rarely did ten years ago, and certainly never did twenty years ago. Yet, if you were told you could never do this again, you would feel completely hamstrung. You probably already did it a few times today, or at least this week. What am I talking about? Want to take a guess? I am talking about performing a Google search.

Google hasn't been around that long, yet we are so dependent on it now. It has disrupted and upended a big swath of industries including magazine publishing, phone directories, newspapers, and so on. Nowadays, whenever we have a knowledge itch, we use Google to scratch it. This is especially the case due to our permanent link to the internet via cell phones. Google has almost become an extension of us.

Yet, have you stopped to think about the mechanics behind all of this incredible knowledge-finding? We literally have billions of documents at our fingertips that can&...

The three V's of big data

It wasn't that long ago that it was customary to purge any log produced by an application after 90 days or so. Companies lately have come to the realization that they were throwing away gold nuggets of information. Additionally, storage has become cheap enough that it's a no brainer to keep these logs. Also, the cloud, the internet, and general advances in technology create even more data now. The number of devices storing and transmitting data, from smartphones to IoT gadgets, industrial sensors, and surveillance cameras have proliferated exponentially around the world, contributing to an explosion in the volume of data.

Figure 4: The three V's of big data

According to IBM, 2.5 exabytes of data were generated every day in 2012. That's a big number by any measure. Also, about 75% of data is unstructured, coming from sources such as text, voice, and video.

The fact that so much of the new data being created is...

Big data and machine learning

Big data technologies are leveraged successfully by technology companies around the world. Today's enterprises understand the power of big data, and they realize that it can be even more powerful when used in conjunction with machine learning.

Machine learning systems coupled with big data technology help businesses in a multitude of ways including managing, analyzing, and using the captured data far more strategically than ever before.

As companies capture and generate ever increasing volumes of data, this presents both a challenge and a great opportunity. Fortunately, these two technologies complement each other symbiotically. Businesses are constantly coming up with new models that increase the computational requirements of the resulting workloads. New advances in big data enable and facilitate the processing of these new use cases. Data scientists are seeing that current architectures can handle this increased workload. They are therefore...

NoSQL Databases

Before we delve deeper into specific types of NoSQL databases, let's first understand what a NoSQL database is. It's not a great name, but it is hard to come up with anything better. As the name implies a NoSQL database is any database that is not a SQL database. It comprises a variety of database technologies that had to be built in response to market demands for products that were able to handle bigger workloads and larger and more diverse datasets.

Data is the new oil and it exists in a wide variety of places. Log files, audio, video, click streams, IoT data, and emails are some examples of the data that needs to be processed and analyzed. Traditional SQL databases require a structured schema before the data can be used. Additionally, they were not built to take advantage of commodity storage and processing power easily available today.

Types of NoSQL databases

Document databases – Document databases are used to store semi-structured...

Summary

In this chapter, initially we laid a foundation of the core and basic concepts around big data. We then learned about many different technologies related to big data. We learned about the "grand daddy" of them all when it comes to big data technologies – Hadoop. We also learned about perhaps the currently most popular big data tool in the market today, which is Spark.

Finally, we learned about another technology that is commonly used in big data implementations, and that is NoSQL databases. NoSQL database engines power many of the biggest workloads in Fortune 500 companies as well as serving millions of pages in the most common websites that exist today.

For all the amazing and exciting applications that exist for machine learning today, it is our firm belief that we are only scratching the surface of what is possible. It is our sincere hope that you feel you have a better grasp of the concepts involved in doing machine learning, but more importantly...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Artificial Intelligence with Python - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781839219535
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Prateek Joshi

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.
Read more about Prateek Joshi