Reader small image

You're reading from  Practical Big Data Analytics

Product typeBook
Published inJan 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781783554393
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Nataraj Dasgupta
Nataraj Dasgupta
author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Right arrow

Chapter 9. Enterprise Data Science

We have thus far discussed various topics regarding both data mining and machine learning. Most of the examples shown were designed so that anyone with a standard computer would be able to run them and complete the exercises. In real-world situations, datasets would be much larger than those encountered in general home use.

Traditionally, we have relied on well-known database technologies such as SQL Server, Oracle, and others for organizational data warehouse and data management. The advent of NoSQL and Hadoop-based solutions made a significant change to this model of operation. Although companies were at first reluctant, the popular appeal of these tools became too large to ignore, and today, most, if not all, large organizations leverage one or more non-traditional contemporary solution for their enterprise data requirements.

Furthermore, the advent of cloud computing has transformed most businesses, and in-house data centers are being rapidly replaced...

Enterprise data science overview


Data science is a relatively new topic in terms of enterprise IT and analytics. Traditionally, researchers and analysts belonged broadly to one of two categories:

  • Highly technical researchers who used complex computing languages and/or hardware for their professional tasks
  • Analysts who could use tools such as Excel and BI platforms in order to perform both simple and complex data analysis

Organizations started looking into Big Data and, more generally, data science platforms in the late 2000s. It had gained immense momentum by 2013, when solutions such as Hadoop and NoSQL platforms were released. The following table shows the developments in data science:

A roadmap to enterprise analytics success


In our experience, analytics, which is a fairly recent term compared to well-established terms such as data warehouse and others, requires a careful approach in order to ensure both immediate success and the consequent longevity of the initiative.

Projects that prematurely attempt to complete an initial analytics project with large-scale, high-budget engagement run the risk of jeopardizing the entire initiative if the project does not turn out as expected.

Moreover, in such projects, the outcome measures are not clearly defined. In other words, measuring the value of the outcome is ambiguous. Sometimes, it cannot be quantified either. This arises because the success of an analytics initiative has benefits beyond simply the immediate monetary or technical competencies. A successful analytics project often helps to foster executive confidence in the department's ability to conduct said projects, which in turn may lead to bigger endeavors.

The general...

Data science solutions in the enterprise


As discussed before, in general, we can broadly categorize data science into two primary sections:

  • Enterprise data warehouse and data mining
  • Enterprise data science: machine learning, artificial intelligence

In this section, we will look at each of these individually and discuss both the software and hardware solutions used in the industry for delivering these capabilities.

Enterprise data warehouse and data mining

Today, there are scores of databases available in the industry that are marketed as NoSQL systems capable of running complex analytical queries. Most of them have one or more features of typical NoSQL systems, such as columnar, in-memory, key-value, document-oriented, graph-based, and so on. The next section highlights some of the key enterprise NoSQL systems in use today.

Traditional data warehouse systems

Traditional data warehouses might be a misnomer, since most of the traditional systems have also incorporated core concepts of NoSQL. However...

Enterprise data science – machine learning and AI


Data science solutions have matured rapidly over the past 4 - 5 years, similar to the movement in other areas of data science such as NoSQL, Hadoop, and other data mining solutions. Although many of the prior database systems also incorporate key features of data science, such as machine learning and others, this section highlights some of the solutions at a high level that are primarily used for machine learning and/or AI, as opposed to data management.

Indeed, the distinction between Big Data products and data science products has become blurred, since products that were originally intended for Big Data handling have incorporated key features of data science, and vice versa.

The R programming language

R, as we have seen in prior chapters, is an environment originally designed for statistical programming. It emerged out of a project at the University of New Zealand, where Ross Ihanka and Robert Gentleman developed R as a variation of the S...

Enterprise infrastructure solutions


The proper choice of infrastructure also plays a key role in determining the efficiency of the organization's data science platform. Too little, and the algorithms will take too long to execute; too much and you may have a lot of resources remaining unutilized. As such, the latter is preferable to having too little, which thwarts progress and the ability of any machine learning researcher to efficiently perform his or her tasks.

Cloud computing

Over the past 5 - 7 years, organizations have gradually shifted their resources to cloud-based platforms such as Amazon Web Services, Microsoft Azure, and Google Compute Engine. Today, all of these contain extremely sophisticated and extensive architecture to support machine learning, data mining, and in general data science at an enterprise level to meet the needs of organizations of all sizes.

In addition, the concept of images, such as AMI images in Amazon's AWS, allows users to initiate a pre-built snapshot of...

Tutorial – using RStudio in the cloud


The following tutorial will demonstrate how to create an account on AWS (Amazon Web Services), load an AMI Image for RStudio, and thereafter use RStudio, all at no charge. Readers who are experienced in using cloud platforms may find the instructions quite basic. For other users, the tutorial should provide helpful initial guidance on using AWS.

Please read the Warning message below prior to proceeding.

Warning: Note that AWS requires a credit card for signup. Users must be careful and select only the options for the FREE TIER. The AWS agreement permits Amazon to bill users for incurred charges. Due to this reason, users should use the platform judiciously to avoid potentially expensive unexpected charges from servers or services that are left running.

Note

As of this time, Azure and Google Cloud offer user signups with provisions to avoid inadvertent charges. However, AWS has the highest market share among all cloud vendors, and users are likely to encounter...

Summary


In this chapter, we discussed the requirements for deploying enterprise-scale data science infrastructures, both at a software as well as a hardware level. We shared key common questions around such initiatives at a management level. This was followed by an extensive section on key enterprise solutions that are being used for data mining and machine learning in large organizations.

The tutorial involved launching an RStudio Server on Amazon Web Services (a cloud-based system). AWS has become the leading provider of cloud services in the world today, and the exercise showed how simple it can be to launch entire machines in a few seconds. Appropriate pros and cons about the judicious and careful use of AWS to prevent very expensive charges were mentioned.

The next and final chapter will include some closing thoughts, the next steps, and links to useful resources you can use to learn more about the topics that have been discussed in this book.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Year

Developments

1970s to late 1990s

Widespread use of relational database management systems. Entity relationship model, structured query language (SQL), and other developments eventually led to a rapid expansion of databases in the late 90s.

Early 2000s

The anti-climatic, yet expensive, non-event of Y2K, coupled...