Packt+ | Advance your knowledge in tech

You're reading from Practical Big Data Analytics

Product typeBook

Published inJan 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781783554393

Edition1st Edition

Languages

Java

Tools

Hadoop Apache Spark

Concepts

Big Data

Author (1)

Nataraj Dasgupta

Chapter 5. Big Data Mining with NoSQL

The term NoSQL was first used by Carlo Strozzi, who, in 1998, released the Strozzi NoSQL opensource relational database. In the late 2000s, new paradigms in database architecture emerged, many of which did not adhere to the strict constraints required of relational database systems. These databases, due to their non-conformity with standard database conventions such as ACID compliance, were soon grouped under a broad category known as NoSQL.

Each NoSQL database claims to be optimal for certain use cases. Although few of them would fit the requirements to be a general-purpose database management system, they all leverage a few common themes across the spectrum of NoSQL systems.

In this chapter, we will visit some of the broad categories of NoSQL database management systems. We will discuss the primary drivers that initiated the migration to NoSQL database systems and how such databases solved specific business needs that led to their widespread adoption...

Why NoSQL?

The term NoSQL generally means Not Only SQL: that is, the underlying database has properties that are different to those of common and traditional database systems. As such, there is no clear distinction that qualifies a database as NoSQL, other than the fact that they do not provide the characteristics of ACID compliance. As such, it would be helpful to understand the nature of ACID properties that have been the mainstay of database systems for many decades, as well as discuss, in brief, the significance of BASE and CAP, two other terminologies central to databases today.

The ACID, BASE, and CAP properties

Let's first proceed with ACID and SQL.

ACID and SQL

ACID stands for atomicity, consistency, isolation, and durability:

Atomicity: This indicates that database transactions either execute in full or do not execute at all. In other words, either all transactions should be committed, that is, persisted in their entirety, or not committed at all. There is no scope for a partial execution...

NoSQL databases

In our discussion of NoSQL types and databases, we will primarily focus on the following characteristics of NoSQL databases:

In-memory databases
Columnar databases
Document-oriented databases
Key-value databases
Graph databases
Other NoSQL types and summary

Most types of NoSQL used in the industry today fall into one or more of these categories. The next few sections will discuss the high-level properties of each of these NoSQL offerings, their main advantages, and products in the market that fall into the respective categories.

In-memory databases

In-memory databases, as the name implies, leverage the computer memory; that is, the RAM, to store datasets. Before we look into how in-memory databases work, it would be worthwhile to recollect how data transfer happens in a typical computer:

Simple Data Flow Computer Hierarchy

As shown in the preceding image, data traverses from disk to memory to the CPU. This is a very high-level generalization of the exact process as there are conditions...

Analyzing Nobel Laureates data with MongoDB

In the first exercise, we will use MongoDB, one of the leading document-oriented databases, to analyze Nobel Laureates from 1902-present. MongoDB provides a simple and intuitive interface to work with JSON files. As discussed earlier, JSON is a flexible format that allows representing data using a structured approach.

JSON format

Consider the following table:

Firstname	Lastname	Information
John	15	Subject: History, Grade B
Jack	18	Subject: Physics, Grade A
Jill	17	Subject: Physics, Grade A+

The Information field contains a column containing multiple values categorized under Subject and Grade. Such columns that contain multiple data are also known as columns with nested data.

Portability has been an important aspect of transferring data from one system to another. In general, ODBC connectors are used to transfer data between database systems. Another common format is CSV files with the data represented as comma-separated values. CSV files are optimal for structured...

Tracking physician payments with real-world data

Physicians and hospitals alike receive payments from various external organizations, such as pharmaceutical companies who engage sales representatives to not only educate practitioners on their products, but also provide gifts or payments in kind or otherwise. In theory, gifts or payments made to physicians are not intended to influence their prescribing behavior, and pharmaceutical companies adopt careful measures to maintain checks and balances on payments being made to healthcare providers.

In 2010, President Obama's signature Affordable Care Act (ACA), also known in popular parlance as Obamacare, went into effect. Alongside the ACA, a separate legislation known as the Sunshine Act made reporting items of monetary value (directly or indirectly) mandatory for pharmaceutical companies and other organizations. While such rules existed in the past, rarely were such rules available in the public domain. By making detailed payment records made...

The CMS Open Payments Portal

In this section, we will begin developing our application for CMS Open Payments.

The Packt Data Science VM contains all the necessary software for this tutorial. To download the VM, please refer to Chapter 3, The Analytics Toolkit.

Downloading the CMS Open Payments data

The CMS Open Payments data is available directly as a web-based download from the CMS website. We'll download the data using the Unix wget utility, but first we have to register with the CMS website to get our own API key:

Go to https://openpaymentsdata.cms.gov and click on the Sign In link at the top-right of the page:

Homepage of CMS OpenPayments

Click on Sign Up:

Sign-Up Page on CMS OpenPayments

Enter your information and click on the Create My Account button:

Sign-Up Form for CMS OpenPayments

Sign In to your account:

Signing into CMS OpenPayments

Click on Manage under Packt Developer's Applications. Note that Applications here refers to apps that you may create that will query data available on the...

R Shiny platform for developers

R Shiny introduced a platform for R developers to create JavaScript-based web applications without having to get involved, or, for that, matter even be proficient in JavaScript.

In order to build our application, we will leverage R Shiny and create an interface to connect to the CMS Open Payments data we set up in the prior section.

If you are using your own R installation (locally), you'll need to install a few R packages. Note that if you are using a Linux workstation, you may need to install some additional Linux packages. For example, in Ubuntu Linux, you'll need to install the following. You may already have some of the packages, in which case you'll receive a message indicating that no further changes were needed for the respective package:

sudo apt-get install software-properties-common libssl-dev libcurl4-openssl-dev gdebi-core rlwrap

Note

If you are using the Packt Data Science VM, you can proceed directly to developing the application as these Linux packages...

Summary

This chapter introduced the concept of NoSQL. The term has gained popularity in recent years, especially due to its relevance and direct application to big data analytics. We discussed the core terminologies in NoSQL, their various types, and popular software used in the industry for such capabilities. We concluded with a couple of tutorials using MongoDB and kdb+.

We also built an application using R and R Shiny to create a dynamic web interface to interact with the data loaded in kdb+.

The next chapter will introduce another common technology in data science today, known as Spark. It is yet another toolkit that empowers data scientists across the world today.

The rest of the chapter is locked

You have been reading a chapter from

Practical Big Data Analytics

Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Other recommended products

Related to this chapter

Web Application Development with R Using Shiny

Shiny is an open source R package that provides an elegant and powerful web framework for building web applications using R. This guide takes a fresh approach to developing scalable web applications. It will enable you to create responsive, interactive web applications using the complete R Shiny suite.

BookSep 2018238 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Hands-On Big Data Modeling

Big data modeling is very challenging to handle using traditional database modeling and management systems. This book will teach you how to model big data using the latest and more efficient tools such as ERWIN, ANACONDA (Python), and WEKA to model data.

BookNov 2018306 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Hands-on DevOps

VideoDec 20170

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Hands-On Data Science with R

Hands-On Data Science with R explore various popular R packages to perform various data science tasks, including core statistical concepts and a wide array of use cases. This practical book covers the entire data science ecosystem for aspiring data scientists, including machine learning, NLP, and neural networks

BookNov 2018420 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Artificial Intelligence for Big Data

Create smart systems to extract intelligent insights for decision making. You will learn about widely used Artificial Intelligence techniques for carrying out solutions in a production-ready environment. You'll explore advanced topics such as clustering, symbolic and sub-symbolic information representation, and many more.

BookMay 2018384 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages