Data Analysis with Python

4.8 (4 reviews total)
By David Taieb
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Programming and Data Science – A New Toolset

About this book

Data Analysis with Python offers a modern approach to data analysis so that you can work with the latest and most powerful Python tools, AI techniques, and open source libraries. Industry expert David Taieb shows you how to bridge data science with the power of programming and algorithms in Python. You'll be working with complex algorithms, and cutting-edge AI in your data analysis. Learn how to analyze data with hands-on examples using Python-based tools and Jupyter Notebook. You'll find the right balance of theory and practice, with extensive code files that you can integrate right into your own data projects.

Explore the power of this approach to data analysis by then working with it across key industry case studies. Four fascinating and full projects connect you to the most critical data analysis challenges you’re likely to meet in today. The first of these is an image recognition application with TensorFlow – embracing the importance today of AI in your data analysis. The second industry project analyses social media trends, exploring big data issues and AI approaches to natural language processing. The third case study is a financial portfolio analysis application that engages you with time series analysis - pivotal to many data science applications today. The fourth industry use case dives you into graph algorithms and the power of programming in modern data science. You'll wrap up with a thoughtful look at the future of data science and how it will harness the power of algorithms and artificial intelligence.

Publication date:
December 2018
Publisher
Packt
Pages
490
ISBN
9781789950069

 

Chapter 1. Programming and Data Science – A New Toolset

"Data is a precious thing and will last longer than the systems themselves."

Tim Berners-Lee, inventor of the World Wide Web

(https://en.wikipedia.org/wiki/Tim_Berners-Lee)

In this introductory chapter, I'll start the conversation by attempting to answer a few fundamental questions that will hopefully provide context and clarity for the rest of this book:

  • What is data science and why it's on the rise

  • Why is data science here to stay

  • Why do developers need to get involved in data science

Using my experience as a developer and recent data science practitioner, I'll then discuss a concrete data pipeline project that I worked on and a data science strategy that derived from this work, which is comprised of three pillars: data, services, and tools. I'll end the chapter by introducing Jupyter Notebooks which are at the center of the solution I'm proposing in this book.

 

What is data science


If you search the web for a definition of data science, you will certainly find many. This reflects the reality that data science means different things to different people. There is no real consensus on what data scientists exactly do and what training they must have; it all depends on the task they're trying to accomplish, for example, data collection and cleaning, data visualization, and so on.

For now, I'll try to use a universal and, hopefully, consensual definition: data science refers to the activity of analyzing a large amount of data in order to extract knowledge and insight leading to actionable decisions. It's still pretty vague though; one can ask what kind of knowledge, insight, and actionable decision are we talking about?

To orient the conversation, let's reduce the scope to three fields of data science:

  • Descriptive analytics: Data science is associated with information retrieval and data collection techniques with the goal of reconstituting past events to identify patterns and find insights that help understand what happened and what caused it to happen. An example of this is looking at sales figures and demographics by region to categorize customer preferences. This part requires being familiar with statistics and data visualization techniques.

  • Predictive analytics: Data science is a way to predict the likelihood that some events are currently happening or will happen in the future. In this scenario, the data scientist looks at past data to find explanatory variables and build statistical models that can be applied to other data points for which we're trying to predict the outcome, for example, predicting the likelihood that a credit card transaction is fraudulent in real-time. This part is usually associated with the field of machine learning.

  • Prescriptive analytics: In this scenario, data science is seen as a way to make better decisions, or perhaps I should say data-driven decisions. The idea is to look at multiple options and using simulation techniques, quantify, and maximize the outcome, for example, optimizing the supply chain by looking at minimizing operating costs.

In essence, descriptive data science answers the question of what (does the data tells me), predictive data science answers the question of why (is the data behaving a certain way), and prescriptive data science answers the questions of how (do we optimize the data toward a specific goal).

 

Is data science here to stay?


Let's get straight to the point from the start: I strongly think that the answer is yes.

However, that was not always the case. A few years back, when I first started hearing about data science as a concept, I initially thought that it was yet another marketing buzzword to describe an activity that already existed in the industry: Business Intelligence (BI). As a developer and architect working mostly on solving complex system integration problems, it was easy to convince myself that I didn't need to get directly involved in data science projects, even though it was obvious that their numbers were on the rise, the reason being that developers traditionally deal with data pipelines as black boxes that are accessible with well-defined APIs. However, in the last decade, we've seen exponential growth in data science interest both in academia and in the industry, to the point it became clear that this model would not be sustainable.

As data analytics are playing a bigger and bigger role in a company's operational processes, the developer's role was expanded to get closer to the algorithms and build the infrastructure that would run them in production. Another piece of evidence that data science has become the new gold rush is the extraordinary growth of data scientist jobs, which have been ranked number one for 2 years in a row on Glassdoor (https://www.prnewswire.com/news-releases/glassdoor-reveals-the-50-best-jobs-in-america-for-2017-300395188.html) and are consistently posted the most by employers on Indeed. Headhunters are also on the prowl on LinkedIn and other social media platforms, sending tons of recruiting messages to whoever has a profile showing any data science skills.

One of the main reasons behind all the investment being made into these new technologies is the hope that it will yield major improvements and greater efficiencies in the business. However, even though it is a growing field, data science in the enterprise today is still confined to experimentation instead of being a core activity as one would expect given all the hype. This has lead a lot of people to wonder if data science is a passing fad that will eventually subside and yet another technology bubble that will eventually pop, leaving a lot of people behind.

These are all good points, but I quickly realized that it was more than just a passing fad; more and more of the projects I was leading included the integration of data analytics into the core product features. Finally, it is when the IBM Watson Question Answering system won at a game of Jeopardy! against two experienced champions, that I became convinced that data science, along with the cloud, big data, and Artificial Intelligence (AI), was here to stay and would eventually change the way we think about computer science.

 

Why is data science on the rise?


There are multiple factors involved in the meteoric rise of data science.

First, the amount of data being collected keeps growing at an exponential rate. According to recent market research from the IBM Marketing Cloud (https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=WRL12345GBEN) something like 2.5 quintillion bytes are created every day (to give you an idea of how big that is, that's 2.5 billion of billion bytes), but yet only a tiny fraction of this data is ever analyzed, leaving tons of missed opportunities on the table.

Second, we're in the midst of a cognitive revolution that started a few years ago; almost every industry is jumping on the AI bandwagon, which includes natural language processing (NLP) and machine learning. Even though these fields existed for a long time, they have recently enjoyed the renewed attention to the point that they are now among the most popular courses in colleges as well as getting the lion's share of open source activities. It is clear that, if they are to survive, companies need to become more agile, move faster, and transform into digital businesses, and as the time available for decision-making is shrinking to near real-time, they must become fully data-driven. If you also include the fact that AI algorithms need high-quality data (and a lot of it) to work properly, we can start to understand the critical role played by data scientists.

Third, with advances in cloud technologies and the development of Platform as a Service (PaaS), access to massive compute engines and storage has never been easier or cheaper. Running big data workloads, once the purview of large corporations, is now available to smaller organizations or any individuals with a credit card; this, in turn, is fueling the growth of innovation across the board.

For these reasons, I have no doubt that, similar to the AI revolution, data science is here to stay and that its growth will continue for a long time. But we also can't ignore the fact that data science hasn't yet realized its full potential and produced the expected results, in particular helping companies in their transformation into data-driven organizations. Most often, the challenge is achieving that next step, which is to transform data science and analytics into a core business activity that ultimately enables clear-sighted, intelligent, bet-the-business decisions.

 

What does that have to do with developers?


This is a very important question that we'll spend a lot of time developing in the coming chapters. Let me start by looking back at my professional journey; I spent most of my career as a developer, dating back over 20 years ago, working on many aspects of computer science.

I started by building various tools that helped with software internationalization by automating the process of translating the user interface into multiple languages. I then worked on a LotusScript (scripting language for Lotus Notes) editor for Eclipse that would interface directly with the underlying compiler. This editor provided first-class development features, such as content assist, which provides suggestions, real-time syntax error reporting, and so on. I then spent a few years building middleware components based on Java EE and OSGI (https://www.osgi.org) for the Lotus Domino server. During that time, I led a team that modernized the Lotus Domino programming model by bringing it to the latest technologies available at the time. I was comfortable with all aspects of software development, frontend, middleware, backend data layer, tooling, and so on; I was what some would call a full-stack developer.

That was until I saw a demo of the IBM Watson Question Answering system that beat longtime champions Brad Rutter and Ken Jennings at a game of Jeopardy! in 2011. Wow! This was groundbreaking, a computer program capable of answering natural language questions. I was very intrigued and, after doing some research, meeting with a few researchers involved in the project, and learning about the techniques used to build this system, such as NLP, machine learning, and general data science, I realized how much potential this technology would have if applied to other parts of the business.

A few months later, I got an opportunity to join the newly formed Watson Division at IBM, leading a tooling team with the mission to build data ingestion and accuracy analysis capabilities for the Watson system. One of our most important requirements was to make sure the tools were easy to use by our customers, which is why, in retrospect, giving this responsibility to a team of developers was the right move. From my perspective, stepping into that job was both challenging and enriching. I was leaving a familiar world where I excelled at designing architectures based on well-known patterns and implementing frontend, middleware, or backend software components to a world focused mostly on working with a large amount of data; acquiring it, cleansing it, analyzing it, visualizing it, and building models. I spent the first six months drinking from the firehose, reading, and learning about NLP, machine learning, information retrieval, and statistical data science, at least enough to be able to work on the capabilities I was building.

It was at that time, interacting with the research team to bring these algorithms to market, that I realized how important developers and data scientists needed to collaborate better. The traditional approach of having data scientists solve complex data problems in isolation and then throw the results "over the wall" to developers for them to operationalize them is not sustainable and doesn't scale, considering that the amount of data to process keeps growing exponentially and the required time to market keeps shrinking.

Instead, their role needs to be shifting toward working as one team, which means that data scientists must work and think like software developers and vice versa. Indeed, this looks very good on paper: on the one hand, data scientists will benefit from tried-and-true software development methodologies such as Agile—with its rapid iterations and frequent feedback approach—but also from a rigorous software development life cycle that brings compliance with enterprise needs, such as security, code reviews, source control, and so on. On the other hand, developers will start thinking about data in a new way: as analytics meant to discover insights instead of just a persistence layer with queries and CRUD (short for, create, read, update, delete) APIs.

 

Putting these concepts into practice


After 4 years as the Watson Core Tooling lead architect building self-service tooling for the Watson Question Answering system, I joined the Developer Advocacy team of the Watson Data Platform organization which has the expanded mission of creating a platform that brings the portfolio of data and cognitive services to the IBM public cloud. Our mission was rather simple: win the hearts and minds of developers and help them be successful with their data and AI projects.

The work had multiple dimensions: education, evangelism, and activism. The first two are pretty straightforward, but the concept of activism is relevant to this discussion and worth explaining in more details. As the name implies, activism is about bringing change where change is needed. For our team of 15 developer advocates, this meant walking in the shoes of developers as they try to work with data—whether they're only getting started or already operationalizing advanced algorithms—feel their pain and identify the gaps that should be addressed. To that end, we built and made open source numerous sample data pipelines with real-life use cases.

At a minimum, each of these projects needed to satisfy three requirements:

  • The raw data used as input must be publicly available

  • Provide clear instructions for deploying the data pipeline on the cloud in a reasonable amount of time

  • Developers should be able to use the project as a starting point for similar scenarios, that is, the code must be highly customizable and reusable

The experience and insights we gained from these exercises were invaluable:

  • Understanding which data science tools are best suited for each task

  • Best practice frameworks and languages

  • Best practice architectures for deploying and operationalizing analytics

The metrics that guided our choices were multiple: accuracy, scalability, code reusability, but most importantly, improved collaboration between data scientists and developers.

 

Deep diving into a concrete example


Early on, we wanted to build a data pipeline that extracted insights from Twitter by doing sentiment analysis of tweets containing specific hashtags and to deploy the results to a real-time dashboard. This application was a perfect starting point for us, because the data science analytics were not too complex, and the application covered many aspects of a real-life scenario:

  • High volume, high throughput streaming data

  • Data enrichment with sentiment analysis NLP

  • Basic data aggregation

  • Data visualization

  • Deployment into a real-time dashboard

To try things out, the first implementation was a simple Python application that used the tweepy library (the official Twitter library for Python: https://pypi.python.org/pypi/tweepy) to connect to Twitter and get a stream of tweets and textblob (the simple Python library for basic NLP: https://pypi.python.org/pypi/textblob) for sentiment analysis enrichment.

The results were then saved into a JSON file for analysis. This prototype was a great way to getting things started and experiment quickly, but after a few iterations we quickly realized that we needed to get serious and build an architecture that satisfied our enterprise requirements.

 

Data pipeline blueprint


At a high level, data pipelines can be described using the following generic blueprint:

Data pipeline workflow

The main objective of a data pipeline is to operationalize (that is, provide direct business value) the data science analytics outcome in a scalable, repeatable process, and with a high degree of automation. Examples of analytics could be a recommendation engine to entice consumers to buy more products, for example, the Amazon recommended list, or a dashboard showing Key Performance Indicators (KPIs) that can help a CEO make future decisions for the company.

There are multiple persons involved in the building of a data pipeline:

  • Data engineers: They are responsible for designing and operating information systems. In other words, data engineers are responsible for interfacing with data sources to acquire the data in its raw form and then massage it (some call this data wrangling) until it is ready to be analyzed. In the Amazon recommender system example, they would implement a streaming processing pipeline that captures and aggregates specific consumer transaction events from the e-commerce system of records and stores them into a data warehouse.

  • Data scientists: They analyze the data and build the analytics that extract insight. In our Amazon recommender system example, they could use a Jupyter Notebook that connects to the data warehouse to load the dataset and build a recommendation engine using, for example, collaborative filtering algorithm (https://en.wikipedia.org/wiki/Collaborative_filtering).

  • Developers: They are responsible for operationalizing the analytics into an application targeted at line of business users (business analysts, C-Suite, end users, and so on). Again, in the Amazon recommender system, the developer will present the list of recommended products after the user has completed a purchase or via a periodic email.

  • Line of business users: This encompasses all users that consume the output of data science analytics, for example, business analysts analyzing dashboards to monitor the health of a business or the end user using an application that provides a recommendation as to what to buy next.

Note

In real-life, it is not uncommon that the same person plays more than one of the roles described here; this may mean that one person has multiple, different needs when interacting with a data pipeline.

As the preceding diagram suggests, building a data science pipeline is iterative in nature and adheres to a well-defined process:

  1. Acquire Data: This step includes acquiring the data in its raw form from a variety of sources: structured (RDBMS, system of records, and so on) or unstructured (web pages, reports, and so on):

    • Data cleansing: Check for integrity, fill missing data, fix incorrect data, and data munging

    • Data prep: Enrich, detect/remove outliers, and apply business rules

  2. Analyze: This step combines descriptive (understand the data) and prescriptive (build models) activities:

    • Explore: Find statistical properties, for example, central tendency, standard deviation, distribution, and variable identification, such as univariate and bivariate analysis, the correlation between variables, and so on.

    • Visualization: This step is extremely important to properly analyze the data and form hypotheses. Visualization tools should provide a reasonable level of interactivity to facilitate understanding of the data.

    • Build model: Apply inferential statistics to form hypotheses, such as selecting features for the models. This step usually requires expert domain knowledge and is subject to a lot of interpretation.

  3. Deploy: Operationalize the output of the analysis phase:

    • Communicate: Generate reports and dashboards that communicate the analytic output clearly for consumption by the line of business user (C-Suite, business analyst, and so on)

    • Discover: Set a business outcome objective that focuses on discovering new insights and business opportunities that can lead to a new source of revenue

    • Implement: Create applications for end-users

  4. Test: This activity should really be included in every step, but here we're talking about creating a feedback loop from field usage:

    • Create metrics that measure the accuracy of the models

    • Optimize the models, for example, get more data, find new features, and so on

 

What kind of skills are required to become a data scientist?


In the industry, the reality is that data science is so new that companies do not yet have a well-defined career path for it. How do you get hired for a data scientist position? How many years of experience is required? What skills do you need to bring to the table? Math, statistics, machine learning, information technology, computer science, and what else?

Well, the answer is probably a little bit of everything plus one more critical skill: domain-specific expertise.

There is a debate going on around whether applying generic data science techniques to any dataset without an intimate understanding of its meaning, leads to the desired business outcome. Many companies are leaning toward making sure data scientists have substantial amount of domain expertise, the rationale being that without it you may unknowingly introduce bias at any steps, such as when filling the gaps in the data cleansing phase or during the feature selection process, and ultimately build models that may well fit a given dataset but still end up being worthless. Imagine a data scientist working with no chemistry background, studying unwanted molecule interactions for a pharmaceutical company developing new drugs. This is also probably why we're seeing a multiplication of statistics courses specialized in a particular domain, such as biostatistics for biology, or supply chain analytics for analyzing operation management related to supply chains, and so on.

To summarize, a data scientist should be in theory somewhat proficient in the following areas:

  • Data engineering / information retrieval

  • Computer science

  • Math and statistics

  • Machine learning

  • Data visualization

  • Business intelligence

  • Domain-specific expertise

Note

If you are thinking about acquiring these skills but don't have the time to attend traditional classes, I strongly recommend using online courses.

I particularly recommend this course: https://www.coursera.org/: https://www.coursera.org/learn/data-science-course.

The classic Drew's Conway Venn Diagram provides an excellent visualization of what is data science and why data scientists are a bit of a unicorn:

Drew's Conway Data Science Venn Diagram

By now, I hope it becomes pretty clear that the perfect data scientist that fits the preceding description is more an exception than the norm and that, most often, the role involves multiple personas. Yes, that's right, the point I'm trying to make is that data science is a team sport and this idea will be a recurring theme throughout this book.

 

IBM Watson DeepQA


One project that exemplifies the idea that data science is a team sport is the IBM DeepQA research project which originated as an IBM grand challenge to build an artificial intelligence system capable of answering natural language questions against predetermined domain knowledge. The Question Answering (QA) system should be good enough to be able to compete with human contestants at the Jeopardy! popular television game show.

As is widely known, this system dubbed IBM Watson went on to win the competition in 2011 against two of the most seasoned Jeopardy! champions: Ken Jennings and Brad Rutter. The following photo was taken from the actual game that aired on February 2011:

IBM Watson battling Ken Jennings and Brad Rutter at Jeopardy!

Source: https://upload.wikimedia.org/wikipedia/e

It was during the time that I was interacting with the research team that built the IBM Watson QA computer system that I got to take a closer look at the DeepQA project architecture and realized first-hand how many data science fields were actually put to use.

The following diagram depicts a high-level architecture of the DeepQA data pipeline:

Watson DeepQA architecture diagram

Source: https://researcher.watson.ibm.com/researcher/files/us-mi

As the preceding diagram shows, the data pipeline for answering a question is composed of the following high-level steps:

  1. Question & Topic Analysis (natural language processing): This step uses a deep parsing component which detects dependency and hierarchy between the words that compose the question. The goal is to have a deeper understanding of the question and extracts fundamental properties, such as the following:

    • Focus: What is the question about?

    • Lexical Answer Type (LAT): What is the type of the expected answer, for example, a person, a place, and so on. This information is very important during the scoring of candidate answers as it provides an early filter for answers that don't match the LAT.

    • Named-entity resolution: This resolves an entity into a standardized name, for example, "Big Apple" to "New York".

    • Anaphora resolution: This links pronouns to previous terms in the question, for example, in the sentence "On Sept. 1, 1715 Louis XIV died in this city, site of a fabulous palace he built," the pronoun "he" refers to Louis XIV.

    • Relations detection: This detects relations within the question, for example, "She divorced Joe DiMaggio in 1954" where the relation is "Joe DiMaggio Married X." These type of relations (Subject->Predicate->Object) can be used to query triple stores and yield high-quality candidate answers.

    • Question class: This maps the question to one of the predefined types used in Jeopardy!, for example, factoid, multiple-choice, puzzle, and so on.

  2. Primary search and Hypothesis Generation (information retrieval): This step relies heavily on the results of the question analysis step to assemble a set of queries adapted to the different answer sources available. Some example of answer sources include a variety of full-text search engines, such as Indri (https://www.lemurproject.org/indri.php) and Apache Lucene/Solr (http://lucene.apache.org/solr), document-oriented and title-oriented search (Wikipedia), triple stores, and so on. The search results are then used to generate candidate answers. For example, title-oriented results will be directly used as candidates while document searches will require more detailed analysis of the passages (again using NLP techniques) to extract possible candidate answers.

  3. Hypothesis and Evidence scoring (NLP and information retrieval): For each candidate answer, another round of search is performed to find additional supporting evidence using different scoring techniques. This step also acts as a prescreening test where some of the candidate answers are eliminated, such as the answers that do not match the LAT computed from step 1. The output of this step is a set of machine learning features corresponding to the supporting evidence found. These features will be used as input to a set of machine learning models for scoring the candidate answers.

  4. Final merging and scoring (machine learning): During this final step, the system identifies variants of the same answer and merges them together. It also uses machine learning models to select the best answers ranked by their respective scores, using the features generated in step 3. These machine learning models have been trained on a set of representative questions with the correct answers against a corpus of documents that has been pre-ingested.

As we continue the discussion on how data science and AI are changing the field of computer science, I thought it was important to look at the state of the art. IBM Watson is one of these flagship projects that has paved the way to more advances we've seen since it beats Ken Jennings and Brad Rutter at the game of Jeopardy!.

 

Back to our sentiment analysis of Twitter hashtags project


The quick data pipeline prototype we built gave us a good understanding of the data, but then we needed to design a more robust architecture and make our application enterprise ready. Our primary goal was still to gain experience in building data analytics, and not spend too much time on the data engineering part. This is why we tried to leverage open source tools and frameworks as much as possible:

  • Apache Kafka (https://kafka.apache.org): This is a scalable streaming platform for processing the high volume of tweets in a reliable and fault-tolerant way.

  • Apache Spark (https://spark.apache.org): This is an in-memory cluster-computing framework. Spark provides a programming interface that abstracts a complexity of parallel computing.

  • Jupyter Notebooks (http://jupyter.org): These interactive web-based documents (Notebooks) let users remotely connect to a computing environment (Kernel) to create advanced data analytics. Jupyter Kernels support a variety of programming languages (Python, R, Java/Scala, and so on) as well as multiple computing frameworks (Apache Spark, Hadoop, and so on).

For the sentiment analysis part, we decided to replace the code we wrote using the textblob Python library with the Watson Tone Analyzer service (https://www.ibm.com/watson/services/tone-analyzer), which is a cloud-based rest service that provides advanced sentiment analysis including detection of emotional, language, and social tone. Even though the Tone Analyzer is not open source, a free version that can be used for development and trial is available on IBM Cloud (https://www.ibm.com/cloud).

Our architecture now looks like this:

Twitter sentiment analysis data pipeline architecture

In the preceding diagram, we can break down the workflow in to the following steps:

  1. Produce a stream of tweets and publish them into a Kafka topic, which can be thought of as a channel that groups events together. In turn, a receiver component can subscribe to this topic/channel to consume these events.

  2. Enrich the tweets with emotional, language, and social tone scores: use Spark Streaming to subscribe to Kafka topics from component 1 and send the text to the Watson Tone Analyzer service. The resulting tone scores are added to the data for further downstream processing. This component was implemented using Scala and, for convenience, was run using a Jupyter Scala Notebook.

  3. Data analysis and exploration: For this part, we decided to go with a Python Notebook simply because Python offer a more attractive ecosystem of libraries, especially around data visualizations.

  4. Publish results back to Kafka.

  5. Implement a real-time dashboard as a Node.js application.

With a team of three people, it took us about 8 weeks to get the dashboard working with real-time Twitter sentiment data. There are multiple reasons for this seemingly long time:

  • Some of the frameworks and services, such as Kafka and Spark Streaming, were new to us and we had to learn how to use their APIs.

  • The dashboard frontend was built as a standalone Node.js application using the Mozaïk framework (https://github.com/plouc/mozaik), which made it easy to build powerful live dashboards. However, we found a few limitations with the code, which forced us to dive into the implementation and write patches, hence adding delays to the overall schedule.

The results are shown in the following screenshot:

Twitter sentiment analysis real-ime dashboard

 

Lessons learned from building our first enterprise-ready data pipeline


Leveraging open source frameworks, libraries, and tools definitely helped us be more efficient in implementing our data pipeline. For example, Kafka and Spark were pretty straightforward to deploy and easy to use, and when we were stuck, we could always rely on the developer community for help by using, for example, question and answer sites, such as https://stackoverflow.com.

Using a cloud-based managed service for the sentiment analysis step, such as the IBM Watson Tone Analyzer (https://www.ibm.com/watson/services/tone-analyzer) was another positive. It allowed us to abstract out the complexity of training and deploying a model, making the whole step more reliable and certainly more accurate than if we had implemented it ourselves.

It was also super easy to integrate as we only needed to make a REST request (also known as an HTTP request, see https://en.wikipedia.org/wiki/Representational_state_transfer for more information on REST architecture) to get our answers. Most of the modern web services now conform to the REST architecture, however, we still need to know the specification for each of the APIs, which can take a long time to get right. This step is usually made simpler by using an SDK library, which is often provided for free and in most popular languages, such as Python, R, Java, and Node.js. SDK libraries provide higher level programmatic access to the service by abstracting out the code that generates the REST requests. The SDK would typically provide a class to represent the service, where each method would encapsulate a REST API while taking care of user authentication and other headers.

On the tooling side, we were very impressed with Jupyter Notebooks, which provided excellent features, such as collaboration and full interactivity (we'll cover Notebooks in more detail later on).

Not everything was smooth though, as we struggled in a few key areas:

  • Which programming language to choose for some of the key tasks, such as data enrichment and data analysis. We ended up using Scala and Python, even though there was little experience on the team, mostly because they are very popular among data scientists and also because we wanted to learn them.

  • Creating visualizations for data exploration was taking too much time. Writing a simple chart with a visualization library, such as Matplotlib or Bokeh required writing too much code. This, in turn, slowed down our need for fast experimentation.

  • Operationalizing the analytics into a real-time dashboard was way too hard to be scalable. As mentioned before, we needed to write a full-fledged standalone Node.js application that consumes data from Kafka and needed to be deployed as a cloud-foundry application (https://www.cloudfoundry.org) on the IBM Cloud. Understandably, this task required quite a long time to complete the first time, but we also found that it was difficult to update as well. Changes in the analytics that write data to Kafka needed to be synchronized with the changes on the dashboard application as well.

 

Data science strategy


If data science is to continue to grow and graduate into a core business activity, companies must find a way to scale it across all layers of the organization and overcome all the difficult challenges we discussed earlier. To get there, we identified three important pillars that architects planning a data science strategy should focus on, namely, data, services, and tools:

Three pillars of dat science at scale

  • Data is your most valuable resource: You need a proper data strategy to make sure data scientists have easy access to the curated contents they need. Properly classifying the data, set appropriate governance policies, and make the metadata searchable will reduce the time data scientists spend acquiring the data and then asking for permission to use it. This will not only increase their productivity, it will also improve their job satisfaction as they will spend more time working on doing actual data science.

    Setting a data strategy that enables data scientists to easily access high-quality data that's relevant to them increases productivity and morale and ultimately leads to a higher rate of successful outcomes.

  • Services: Every architect planning for data science should be thinking about a service-oriented architecture (SOA). Contrary to traditional monolithic applications where all the features are bundled together into a single deployment, a service-oriented system breaks down functionalities into services which are designed to do a few things but to do it very well, with high performance and scalability. These systems are then deployed and maintained independently from each other giving scalability and reliability to the whole application infrastructure. For example, you could have a service that runs algorithms to create a deep learning model, another one would persist the models and let applications run it to make predictions on customer data, and so on.

    The advantages are obvious: high reusability, easier maintenance, reduced time to market, scalability, and much more. In addition, this approach would fit nicely into a cloud strategy giving you a growth path as the size of your workload increases beyond existing capacities. You also want to prioritize open source technologies and standardize on open protocols as much as possible.

    Breaking processes into smaller functions infuses scalability, reliability, and repeatability into the system.

  • Tools do matter! Without the proper tools, some tasks become extremely difficult to complete (at least that's the rationale I use to explain why I fail at fixing stuff around the house). However, you also want to keep the tools simple, standardized, and reasonably integrated so they can be used by less skilled users (even if I was given the right tool, I'm not sure I would have been able to complete the house fixing task unless it's simple enough to use). Once you decrease the learning curve to use these tools, non-data scientist users will feel more comfortable using them.

    Making the tools simpler to use contributes to breaking the silos and increases collaboration between data science, engineering, and business teams.

 

Jupyter Notebooks at the center of our strategy


In essence, Notebooks are web documents composed of editable cells that let you run commands interactively against a backend engine. As their name indicates, we can think of them as the digital version of a paper scratch pad used to write notes and results about experiments. The concept is very powerful and simple at the same time: a user enters code in the language of his/her choice (most implementations of Notebooks support multiple languages, such as Python, Scala, R, and many more), runs the cell and gets the results interactively in an output area below the cell that becomes part of the document. Results could be of any type: text, HTML, and images, which is great for graphing data. It's like working with a traditional REPL (short for, Read-Eval-Print-Loop) program on steroids since the Notebook can be connected to powerful compute engines (such as Apache Spark (https://spark.apache.org) or Python Dask (https://dask.pydata.org) clusters) allowing you to experiment with big data if needed.

Within Notebooks, any classes, functions, or variables created in a cell are visible in the cells below, enabling you to write complex analytics piece by piece, iteratively testing your hypotheses and fixing problems before moving on to the next phase. In addition, users can also write rich text using the popular Markdown language or mathematical expressions using LaTeX (https://www.latex-project.org/), to describe their experiments for others to read.

The following figure shows parts of a sample Jupyter Notebook with a Markdown cell explaining what the experiment is about, a code cell written in Python to create 3D plots, and the actual 3D charts results:

ample Jupyter Notebook

Why are Notebooks so popular?

In the last few years, Notebooks have seen a meteoric growth in popularity as the tool of choice for data science-related activities. There are multiple reasons that can explain it, but I believe the main one is its versatility, making it an indispensable tool not just for data scientists but also for most of the personas involved in building data pipelines, including business analysts and developers.

For data scientists, Notebooks are ideal for iterative experimentation because it enables them to quickly load, explore, and visualize data. Notebooks are also an excellent collaboration tool; they can be exported as JSON files and easily shared across the team, allowing experiments to be identically repeated and debugged when needed. In addition, because Notebooks are also web applications, they can be easily integrated into a multi-users cloud-based environment providing an even better collaborative experience.

These environments can also provide on-demand access to large compute resources by connecting the Notebooks with clusters of machines using frameworks such as Apache Spark. Demand for these cloud-based Notebook servers is rapidly growing and as a result, we're seeing an increasing number of SaaS (short for, Software as a Service) solutions, both commercial with, for example, IBM Data Science Experience (https://datascience.ibm.com) or DataBricks (https://databricks.com/try-databricks) and open source with JupyterHub (https://jupyterhub.readthedocs.io/en/latest).

For business analysts, Notebooks can be used as presentation tools that in most cases provide enough capabilities with its Markdown support to replace traditional PowerPoints. Charts and tables generated can be directly used to effectively communicate results of complex analytics; there's no need to copy and paste anymore, plus changes in the algorithms are automatically reflected in the final presentation. For example, some Notebook implementations, such as Jupyter, provide an automated conversion of the cell layout to the slideshow, making the whole experience even more seamless.

Note

For reference, here are the steps to produce these slides in Jupyter Notebooks:

  • Using the View | Cell Toolbar | Slideshow, first annotate each cell by choosing between Slide, Sub-Slide, Fragment, Skip, or Notes.

  • Use the nbconvert jupyter command to convert the Notebook into a Reveal.js-powered HTML slideshow:

  • Optionally, you can fire up a web application server to access these slides online:

      
jupyter nbconvert <pathtonotebook.ipynb> --to slides
      jupyter nbconvert <pathtonotebook.ipynb> --to slides –post serve

For developers, the situation is much less clear-cut. On the one hand, developers love REPL programming, and Notebooks offer all the advantages of an interactive REPL with the added bonuses that it can be connected to a remote backend. By virtue of running in a browser, results can contain graphics and, since they can be saved, all or part of the Notebook can be reused in different scenarios. So, for a developer, provided that your language of choice is available, Notebooks offer a great way to try and test things out, such as fine-tuning an algorithm or integrating a new API. On the other hand, there is little Notebook adoption by developers for data science activities that can complement the work being done by data scientists, even though they are ultimately responsible for operationalizing the analytics into applications that address customer needs.

To improve the software development life cycle and reduce time to value, they need to start using the same tools, programming languages, and frameworks as data scientists, including Python with its rich ecosystem of libraries and Notebooks, which have become such an important data science tool. Granted that developers have to meet the data scientist in the middle and get up to speed on the theory and concept behind data science. Based on my experience, I highly recommend using MOOCs (short for, Massive Open Online Courses) such as Coursera (https://www.coursera.org) or EdX (http://www.edx.org), which provide a wide variety of courses for every level.

However, having used Notebooks quite extensively, it is clear that, while being very powerful, they are primarily designed for data scientists, leaving developers with a steep learning curve. They also lack application development capabilities that are so critical for developers. As we've seen in the Sentiment analysis of Twitter Hashtags project, building an application or a dashboard based on the analytics created in a Notebook can be very difficult and require an architecture that can be difficult to implement and that has a heavy footprint on the infrastructure.

It is to address these gaps that I decided to create the PixieDust (https://github.com/ibm-watson-data-lab/pixiedust) library and open source it. As we'll see in the next chapters, the main goal of PixieDust is to lower the cost of entry for new users (whether it be data scientists or developers) by providing simple APIs for loading and visualizing data. PixieDust also provides a developer framework with APIs for easily building applications, tools, and dashboards that can run directly in the Notebook and also be deployed as web applications.

 

Summary


In this chapter, I gave my perspective on data science as a developer, discussing the reasons why I think that data science along with AI and Cloud has the potential to define the next era of computing. I also discussed the many problems that must be addressed before it can fully realize its potential. While this book doesn't pretend to provide a magic recipe that solves all these problems, it does try to answer the difficult but critical question of democratizing data science and more specifically bridging the gap between data scientists and developers.

In the next few chapters, we'll dive into the PixieDust open source library and learn how it can help Jupyter Notebooks users be more efficient when working with data. We'll also deep dive on the PixieApp application development framework that enables developers to leverage the analytics implemented in the Notebook to build application and dashboards.

In the remaining chapters, we will deep dive into many examples that show how data scientists and developers can collaborate effectively to build end-to-end data pipelines, iterate on the analytics, and deploy them to end users at a fraction of the time. The sample applications will cover many industry use-cases, such as image recognition, social media, and financial data analysis which include data science use cases like descriptive analytics, machine learning, natural language processing, and streaming data.

We will not discuss deeply the theory behind all the algorithms covered in the sample applications (which is beyond the scope of this book and would take more than one book to cover), but we will instead emphasize how to leverage the open source ecosystem to rapidly complete the task at hand (model building, visualization, and so on) and operationalize the results into applications and dashboards.

Note

The provided sample applications are written mostly in Python and come with complete source code. The code has been extensively tested and is ready to be re-used and customized in your own projects.

About the Author

  • David Taieb

    David Taieb is the Distinguished Engineer for the Watson and Cloud Platform Developer Advocacy team at IBM, leading a team of avid technologists on a mission to educate developers on the art of the possible with data science, AI and cloud technologies. He's passionate about building open source tools, such as the PixieDust Python Library for Jupyter Notebooks, which help improve developer productivity and democratize data science. David enjoys sharing his experience by speaking at conferences and meetups, where he likes to meet as many people as possible.

    Browse publications by this author

Latest Reviews

(4 reviews total)
Nearly half of this book is devoted o Pixie tools and should have been mentioned in the title. Otherwise, some useful examples in NLP.
Practical in Learning data science.
The book is full of valuable insight and tools for productivity. I'd recommend it to anyone looking to dive into data analysis.

Recommended For You

Book Title
Unlock this book and the full library for only $5/m
Access now