Home

Data

Reproducible Data Science with Pachyderm

By Svetlana Karslioglu

Book

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.

Publication date:: March 2022
Publisher: Packt
Pages: 364
ISBN: 9781801074483
Download code from GitHub

Chapter 1: The Problem of Data Reproducibility

Today, machine learning algorithms are used everywhere. They are integrated into our day-to-day lives, and we use them without noticing. While we are rushing to work, planning a vacation, or visiting a doctor's office, the models are at work, at times making important decisions about us. If we are unsure what the model is doing and how it makes decisions, how can we be sure that its decisions are fair and just? Pachyderm profoundly cares about the reproducibility of data science experiments and puts data lineage, reproducibility, and version control at its core. But before we proceed, let's discuss why reproducibility is so important.

This chapter explains the concepts of reproducibility, ethical Artificial Intelligence (AI), and Machine Learning Operations (MLOps), as well as providing an overview of the existing data science platforms and how they compare to Pachyderm.

In this chapter, we're going to cover the following main topics:

Why is reproducibility important?
The reproducibility crisis in science
Demystifying MLOps
Types of data science platforms
Explaining ethical AI

Why is reproducibility important?

First of all, let's define AI, ML, and data science.

Data science is a field of study that involves collecting and preparing large amounts of data to extract knowledge and produce insights.

AI is more of an umbrella term for technology that enables machines to mimic the behavior of human beings. Machine learning is a subset of AI that is based on the idea that an algorithm can learn based on past experiences.

Now, let's define reproducibility. A data science experiment is considered reproducible if other data scientists can repeat it with a comparable outcome on a similar dataset and problem. And although reproducibility has been a pillar of scientific research for decades, it has only recently become an important topic in the data science scope.

Not only is a reproducible experiment more likely to be free of errors, but it also takes the experiment further and allows others to build on top of it, contributing to knowledge transfer and speeding up future discoveries.

It's not a secret that data science has become one of the hottest topics in the last 10 years. Many big tech companies have opened tens of high-paying data scientist, data engineering, and data analyst positions. With that, the demand to join the profession has been rising exponentially. According to the AI Index 2019 Annual Report published by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the number of AI papers has grown threefold in the last 20 years. You can read more about this report on the Stanford University HAI website: https://hai.stanford.edu/blog/introducing-ai-index-2019-report.

Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)

Almost every learning platform and university now offers a data science or AI program, and these programs never lack students. Thousands of people of all backgrounds, from software developers to CEOs, take ML classes to keep up with the rapidly growing industry.

The number of AI conferences has been steadily growing as well. Even in the pandemic world, where in-person events have become impossible, the AI community has continued to meet in a virtual format. Such flagship conferences as Neural Information Processing Systems (NeurIPS) and International Conference on Machine Learning (ICML), which typically attract more than 10,000 visitors, took place online with significant attendance.

According to some predictions, the AI market size will increase to more than $350 billion by 2025. The market grew from $12 billion to $58 billion from 2020 to 2021 alone. The Silicon Valley tech giants are fiercely battling to achieve dominance in the space, while smaller players emerge to get their share of the market. The number of AI start-ups worldwide is steadily growing, with billions being invested in them each year.

The following graph shows the growth of AI-related start-ups in recent years:

Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)

The total private investment in AI start-ups grew by more than 30 times in the last 10 years.

And another interesting metric from the same source is the number of AI patents published between 2015 and 2018:

Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)

The United States is leading in the number of published patents among other countries.

These trends boost the economy and industry but inevitably affect the quality of submitted AI papers, processes, practices, and experiments. That's why a proper process is needed to ensure the validation of data science models. The replication of experiments is an important part of a data science model's quality control.

Next, let's learn what a model is.

What is a model?

Let's define what a model is. A data science or AI model is a simplified representation of a process that also suggests possible results. Whether it is a weather-prediction algorithm or a website attendance calculator, a model provides the most probable outcome and helps us make informed decisions. When a data scientist creates a model, they need to make decisions about the critical parameters that must be included in that model because they cannot include everything. Therefore, a model is a simplified version of a process. And that's when sacrifices are made based on the data scientist's or organization's definition of success.

The following diagram demonstrates a data model:

Figure 1.4 – Data science model

Every model needs a continuous data flow to improve and perform correctly. Consider the Amazon Go stores where shoppers' behavior is analyzed by multiple cameras inside the store. The models that ensure safety in the store are trained continuously on real-life customer behavior. These models had to learn that sometimes shoppers might pick up an item and then change their mind and put it back; sometimes shoppers can drop an item on the floor, damaging the product, and so on. The Amazon Go store model is likely good because it has access to a lot of real data, and it improves over time. However, not all models have access to real data, and that's when a synthetic dataset can be used.

A synthetic dataset is a dataset that was generated artificially by a computer. The problem with synthetic data is that it is only as good as the algorithm that generated it. Often, such data misrepresents the real world. In some cases, such as when users' privacy prevents data scientists from using real data, usage of a synthetic dataset is justified; in other cases, it can lead to negative results.

IBM's Watson was an ambitious project that promised to revolutionize healthcare by promising to diagnose patients based on a provided list of symptoms in a matter of a few seconds. This invention could greatly speed up the diagnosis process. In some places on this planet, where people have no access to healthcare, a system like that could save many lives. Unfortunately, since the original promise was to replace doctors, Watson is a recommendation system that can assist in diagnosing, but nothing more than that. One of the reasons is that Watson was trained on a synthetic dataset and not on real data.

There are cases when detecting issues in a trained model can be especially difficult. Take the example of an image recognition algorithm developed by the University of Washington that was built to identify whether an image had a husky portrayed in it or a wolf. The model was seemingly working really well, predicting the correct result with almost 90% accuracy. However, when the scientists dug a bit deeper into the algorithm and data, they learned that the model was basing its predictions on the background. The majority of images with huskies had grass in the background, while the majority of images with wolves had snow in the background.

The main principles of reproducibility

How can you ensure that a data science process in your company adheres to the principles of reproducibility? Here is a list of the main principles of reproducibility:

Use open data: The data that is used for training models should not be a black box. It has to be available to other data scientists in an unmodified state.
Train the model on many examples: The information about experiments and on how many examples it was trained must be available for review.
Rigorously document the process: The process of data modifications, statistical failures, and experiment performance must be thoroughly documented so that the author and other data scientists can reproduce the experiment in the future.

Let's consider a few examples where reproducibility, collaboration, and open data principles were not part of the experiment process.

A few years ago, a group of scientists at Duke University became wildly popular because they emerged with an ambitious claim of predicting the course of lung cancer based on the data collected from patients. The medical community was very excited about the prospect of such a discovery. However, a group of other scientists in the MD Anderson Cancer Centre in Houston found severe errors in that research when they tried to reproduce the original result. They discovered mislabeling in the chemotherapy prediction model, mismatches in genes to gene-expression data, and other issues that would make correct treatment prescription based on the model calculations significantly less likely. While the flaws were eventually unraveled, it took almost 3 years and more than 2,000 working hours for the researchers to get to the bottom of the problem, which could have been easily avoided if the proper research practices were established in the first place.

Now let's look at how AI can go wrong based on a chatbot example. You might remember the infamous Microsoft chatbot called Tay. Tay was a robot who could learn from his conversations with internet users. When Tay went live, his first conversations were friendly, but overnight his language changed, and he started to post harmful, racist, and overall inappropriate responses. He learned from the users who taught him to be rude, and as the bot was designed to mirror human behavior, he did what he was created for. Why was he not racist from the very beginning, you might ask? The answer is that he was trained on clean, cherry-picked data that did not include vulgar and abusive language. But we cannot control the web and what people post, and the bot did not have any sense of morals programmed into it. This experiment raised many questions about AI ethics and how we can ensure that the AI that we build does not turn on us one day.

The new generation of chatbots is built on the recently released GPT-3 library. These chatbots are trained with neural networks that, during training, create associations that cannot be broken. These chatbots, although using a seemingly more advanced technology behind them than their predecessors, still easily might become racists and hateful depending on the data they are trained on. If a bot is trained on misogynist and hateful conversations, it will be offensive and will likely reply inappropriately.

As you can see, data science, AI, and machine learning are powerful technologies that help us solve many difficult problems, but at the same time, they can endanger their users and have devastating consequences. The data science community needs to work on devising better ways of minimizing adverse outcomes by establishing proper standards and processes to ensure the quality of data science experiments and AI software.

Now that we've seen why reproducibility is so important, let's look at what consequences it has on the scientific community and data science.

The reproducibility crisis in science

The reproducibility crisis is a problem that has been around for more than a decade. Because data science is a close discipline to science, it is important to review the issues many scientists have outlined in the past and correlate them with similar problems the data science space is facing today.

One of the most important issues is replication—the ability to reproduce the results of a scientific experiment has been one of the founding principles of good research. In other words, if an experiment can be reproduced, it is valid, and if not, it could be a one-time occurrence that does not represent real phenomena. Unfortunately, in recent years, more and more research papers in sociology, medicine, biology, and other areas of science cannot withhold retesting against an increased number of samples, even if these papers were published in well-known and trustworthy science magazines, such as Nature. This tendency could lead to public mistrust in science and AI as part of it.

As was mentioned previously, because of the popularity and growth of the AI industry, the number of AI papers has increased multiple times. Unfortunately, the quality of these papers does not grow with the number of papers itself.

Nature magazine recently conducted a survey among scientists asking them whether they feel that there is a reproducibility crisis in science. The majority of scientists agreed that false-positive results due to pressure to publish results frequently definitely exists. Researchers need sponsorship and sponsors need to see results to invest additional money in the research, which results in many published papers with declining credibility. Ultimately, the fight for grants and bureaucracy are often named as the main causes of the lack of the reproducibility process in labs.

The research papers that were questioned for integrity have the following common attributes:

No code or data were publicly shared for other researchers to attempt to replicate the results.
The scientists who attempted to replicate the results failed completely or partially to do it by following the provided instructions.

Even the papers published by Nobel laureates can sometimes be questioned due to an inability to reproduce the results. For example, in 2014, Science magazine retracted a paper published by Nobel Prize winner and immunologist Bruce Beutler. His paper was about the response to pathogens by virus-like organisms in the human genome. This paper was cited over 50 times before it was retracted.

When COVID-19 become a major topic of 2020, multiple papers were published on it. According to Retraction Watch, an online blog that tracks the scientific papers that have been called off, as of March 2021 more than 86 of them were retracted.

In 2019, more than 1,400 science papers were retracted by multiple publishers. This number is huge and has been steadily growing, compared to only 50 papers in the early 2000s. This raises awareness of a so-called reproducibility crisis in science. While not all papers are retracted for that reason, oftentimes it happens because of that.

Data fishing

Data fishing or data dredging is a method of achieving a statistically significant result of an experiment by running the computation multiple times before the desired result is achieved and only reporting these results and ignoring the inconvenient results. Sometimes, scientists unintentionally dredge the data to achieve the result they think is most probable and confirm their hypothesis. A more sinister plan can take place too—a scientist might be intentionally hacking the result of the experiment to achieve a predefined conclusion.

An example of such a misuse of data analysis would be if you decided to prove that there is a correlation between banana consumption and an increased level of IQ in children of age 10 and older. This is a completely made-up example, but say you wanted to establish this connection. You would need to get information about IQ level and banana consumption of a big enough sample of children – let's say 5,000.

Then, you would run tests, such as: do kids who eat bananas and exercise have a higher IQ level than the ones who only exercise? Do kids who watch TV and eat bananas have a higher level of IQ compared to the ones who do not? After conducting these tests enough times, you most likely would get some kind of correlation. However, this result would not be significant, and using the data dredging technique is considered extremely unethical by the scientific community. In data science specifically, similar problems are being seen.

Without conducting a full investigation, detecting data dredging might be difficult. Possible factors to look for include the following:

Was the research conducted by a reputable institution or group of scientists?
What does other research in similar areas suggest?
Is financial interest involved?
Is the claim sensational?

Without a proper process, data dredging and unreliable researchers will continue to be published. Recently, Nature magazine surveyed around 1,500 researchers from different areas of science and more than 50% of respondents outlined that they have tried and failed to reproduce the results of research in the past. Even more shockingly, in many cases, they failed to reproduce the results of their own experiments.

Out of all respondents, only 24% were able to successfully publish their reproduction attempts and the majority were never contacted with a request to reproduce someone else's research.

Of course, increasing the reproducibility of experiments is a costly problem and can double the time required to conduct an experiment, which many research laboratories might not be able to afford. But if it's added to the originally planned time for the research and has a proper process, it should not be as difficult or rigorous as adding it midway in the research lifecycle.

Even worse, retracting a paper after it was published can be a tedious task. Some publishers even charge researchers a significant amount of money if a paper is retracted. Such practices are truly discouraging.

All of this negatively impacts research all over the world and results in growing mistrust in science. Organizations must take steps to improve processes in their scientific departments and scientific journals must raise the bar of publishing research.

Now that we have learned about data fishing, let's review better reproducibility guidelines.

Better reproducibility in science research guidelines

The Center for Open Science (COS), a non-profit organization that focuses on supporting and promoting open-science initiatives, reproducibility, and integrity of scientific research, has published Guidelines for Transparency and Guidelines for Transparency and Openness Promotion (TOP) in Journal Policies and Practices, or the TOP Guidelines. These guidelines emphasize the importance of transparency in published research papers. Researchers can use them to justify the necessity of sharing research artifacts publicly to avoid any possible inquiries regarding the integrity of their work.

The main principles of the TOP Guidelines include the following:

Proper citation and credit to original authors: All text, code, and data artifacts that belong to other authors must be outlined in the paper and credit given as needed.
Data, methodology, and research material transparency: The authors of the paper must share the written code, methodology, and research materials in a publicly accessible location with instructions on how to access and use them.
Design and analysis transparency: The authors should be transparent about the methodology as much as possible, although this might vary by industry. At a minimum, they must disclose the standards that have been applied during the research.
Preregistrations of the research and analysis plans: Even if research does not get published, preregistration makes it more discoverable.
Reproducibility of obtained results: The authors must include sufficient details on how to reproduce the original results.

There are three levels that are applied to all these metrics:

Not implemented—information is not included in the report
Level 1—available upon request
Level 2—access before publication
Level 3—verification before publication

Level 3 is the highest level of transparency that a metric can achieve. Having this level of transparency justifies the quality of submitted research. COS applies the TOP factor to rate a journal's efforts to ensure transparency and ultimately the quality of the published research.

Apart from data and code reproducibility, often the environment and software used during the research play a big role. New technologies, such as containers and virtual and cloud environments make it easy to achieve uniformity in conducted research. Of course, if we consider biochemistry or other industries that require more precise lab conditions, achieving uniformity might be even more complex.

Now let's learn about common practices that help improve reproducibility.

Common practices to improve reproducibility

Thanks to the work of reproducibility advocates and the problem being widely discussed in scientific communities in recent years, some positive tendencies in increasing reproducibility seem to be emerging. These practices include the following:

Request a colleague to reproduce your work.
Develop extensive documentation.
Standardize research methodology.
Preregister your research before publication to avoid data cherry-picking.

There are scientific groups that make it their mission to reproduce and notify researchers about mistakes in their papers. Their typical process is to try to reproduce the result of a paper and write a letter to the researchers or lab to request a correction or retraction. Some researchers willingly collaborate and correct the mistakes in the paper, but in other cases, it is unclear and difficult. One such group has identified the following problems in the 25 papers that they analyzed:

Lack of process or point of contact regarding to whom they should address feedback on a paper. Scientific journals do not provide a clear statement on whether feedback can be addressed to the chief editor or whether there is a feedback submission form of some sort.
Scientific journal editors accept and act on submissions unwillingly. In some cases, it might take up to a year to publish a warning on a paper that has received critical feedback, even if it was provided by a reputable institution.
Some publishers expect you to pay if you want to publish a correction letter and delay retractions.
Raw data is not always available publicly. In many cases, publishers did not have a unified process around a shared location for the raw data used in the research. If you have to directly contact an author, you might not be able to get the requested information and it might significantly delay the process. Moreover, they can simply deny such a request.

The lack of a standard in submitting corrections and research paper retractions contributes to the overall reproducibility crisis and knowledge sharing. The papers that used data dredging and other techniques to manipulate the results will become a source of information for future researchers, contributing to the overall misinformation and chaos. Researchers, publishers, and editors should work together on establishing unified post-publication review guidelines that encourage other scientists to participate in testing and providing feedback.

We've learned how reproducibility affects the quality of research. Now, let's review how organizations can establish a process to ensure their data science experiments adhere to best industry practices to ensure high standards.

Demystifying MLOps

This section defines Machine Learning Operations (MLOps) and describes why it is crucial to establish a reliable MLOps process within your data science department.

In many organizations, data science departments have been created fairly recently, in the last few years. The profession of data scientist is fairly new as well. Therefore, many of these departments have to find a way to integrate into the existing corporate process and devise ways to ensure the reliability and scalability of data science deliverables.

In many cases, the burden of building a suitable infrastructure falls on the shoulders of the data scientists themselves, who often are not as familiar with the latest infrastructure trends. Another problem is how to make it all work for different languages, platforms, and environments. In the end, data scientists spend more time on building the infrastructure than on working on the model itself. This is where the new discipline has emerged to help bridge the gap between data science and infra.

MLOps is a lifecycle process that identifies the stages of machine learning operations, ensuring the reliability of the data science process. MLOps is a set of practices that define the machine learning development process. Although the term was coined fairly recently, most data scientists agree that a successful MLOps process should adhere to the following principles:

Collaboration: This principle implies that everything that goes into developing an ML model must be shared among data scientists to preserve knowledge.
Reproducibility: This principle implies that not only the code but datasets, metadata, and parameters should be versioned and reproducible for all production models.
Continuity: This principle implies that a lifecycle of a model is a continuous process that means repetition of the lifecycle stages and improvement of the model with each iteration.
Testability: This principle implies that the organization implements ML testing and monitoring practices to ensure the model's quality.

Before we dive into the MLOps process stages, let's take a look at more established software development practices. DevOps is a software development practice that is used in many enterprise-level software projects. A typical DevOps lifecycle includes the following stages that continuously repeat, ensuring product improvement:

Planning: In this stage, the overall vision for the software is developed, and a more detailed design is devised.
Development: In this stage, the code is written, and the planned functionality is implemented. The code is shared through version control systems, such as Git, which ensures collaboration between software developers.
Testing: In this stage, the developed code is tested for defects through an automated or manual process.
Deployment: In this stage, the code is released to production servers, and the users have a chance to test it and provide feedback.
Monitoring: In this stage, the DevOps engineers focus on software performance and causes of outages, identifying possible areas of improvement.
Operations: This stage ensures the automated release of software updates.

The following diagram illustrates the DevOps lifecycle:

Figure 1.5 – DevOps Lifecycle

All these phases are continuously repeated, enabling communication between departments and a customer feedback loop. This practice has brought enterprises such benefits as a faster development cycle, better products, and continuous innovation. Better teamwork enabled by the close relationships between departments is one of the key factors that make this process efficient.

Data scientists deserve a process that brings the same level of reliability. One of the biggest problems of enterprise data science is that very few machine learning models make it to production. Many companies are just starting to adopt data science, and the new departments face unprecedented challenges. Often, the teams lack an understanding of the workflows that need to be implemented in order to make enterprise-level data science work.

Another important challenge is that unlike in traditional software development, data scientists operate not only with code but also with data and parameters. Data is taken from the real world, and the code is accurately developed in the office. The only time they cross is when they are combined in a data model.

The challenges that all data science departments face include the following:

Inconsistent or totally absent data science processes
No way to track data changes and reproduce past results
Slow performance

In many enterprises, data science departments are still small and struggle to create a reliable workflow. Building such a process requires certain expertise, such as an understanding of traditional software practices, such as DevOps, mixed with an understanding of data science challenges. That is where MLOps started to emerge, combining data science with best practices of software development.

If we try to apply similar DevOps practices to data science, here is what we might see:

Design: In this phase, data scientists work on acquiring the data and designing a data pipeline, also known as an Extract, Transform, Load (ETL) pipeline. A data pipeline is a sequence of transformation steps data goes through, which ends with an output result.
Development: In this stage, data scientists work on writing the algorithmic code for the previously developed data pipeline.
Training: In this stage, the model is trained with the selected or autogenerated data. During this stage, such techniques as hyperparameter tuning can be used.
Validation: In this stage, the trained data is validated to work with the rest of the data pipeline.
Deployment: In this stage, the trained and validated model is deployed into production.
Monitoring: In this stage, the model is constantly monitored for performance and possible flaws, and feedback is delivered directly to the data scientist for further improvement.

Similar to DevOps, the stages of MLOps are constantly repeated. The following diagram shows the stages of MLOps:

Figure 1.6 – MLOps Lifecycle

As you can see, the two practices are very similar, and the latter borrows the main concepts from the former. Using MLOps in practice has brought the following advantages to enterprise-level data science:

Faster go-to-market delivery: A data science model only has value when it is successfully deployed in production. With so many companies struggling to implement a proper process in their data science departments, an MLOps solution can genuinely make a difference.
Cross-team collaboration and communication: Software-development practices applied to data science create a common ground for developers, data scientists, and IT operations to work together and speak the same language.
Reproducibility and knowledge transfer: Keeping the code, the datasets, and the history of changes plays a big role in the improvement of overall model quality and enables data scientists to learn from each other's examples, contributing to innovation and feature development.
Automation: Automating a data pipeline helps to keep the process consistent across multiple releases and speeds up the promotion of a Proof of Concept (POC) model to a production-grade pipeline.

In this section, we've learned about the important stages of the MLOps process. In the next section, we will learn more about the types of data science platforms that can help you implement MLOps in your organization.

Types of data science platforms

This section walks you through the data science platforms that are available in the open source world and on the market today and will help you understand the difference between them.

As new fields of AI and machine learning emerge, more and more engineers are working on new ways of solving data science problems, creating an infrastructure for better, faster AI adoption. Some platforms provide end-to-end capabilities for data from a data warehouse all the way to production, while others offer partial functionality and work in combination with other tools. Generally, there is no solution that fits all use cases, and certainly not every budget.

However, all of these solutions completely or partially facilitate the following stages of a data science lifecycle:

Data Engineering
Data Acquisition and Transformation
Data Training
Model Deployment
Monitoring and Improvement

The following diagram shows the types of data science tools:

Figure 1.7 – Types of data science tools

Let's take a look at the existing data science platforms that can help you to build your data science workflow at scale.

End-to-end platforms

An end-to-end data science solution should be able to provide the tooling for all the stages of the ML lifecycle listed in the previous section. However, in some use cases, the definition of the end-to-end workflow could be different and might mostly work with the ML pipelines and projects, excluding the data engineering part. Since the definition may still fluctuate, it is likely that the end-to-end tools will continue to provide different functionalities as the field evolves.

If such a platform does exist, it should bring the following benefits:

A unified user interface that eliminates the need to stitch multiple interfaces together
Collaboration for all involved individuals, including data scientists, data engineers, and IT operations
The convenience of infrastructure support being offloaded to the solution provider, which offers the team additional time to focus on data models rather than on infrastructure problems

However, you might find the following disadvantages of an end-to-end platform to be inconsistent with your organization's goals:

Portability: Such a platform would likely be proprietary, and migration to a different platform would be difficult.
Price: An end-to-end platform will likely be subscription-based, which many data science departments might not be able to afford. If GPU-based workflows are involved, the price increases even more.
Bias: When you are using a proprietary solution that offers built-in pipelines, your models are bound to inherit bias from these automated tools. The problem is that bias might be difficult to recognize and address in automated ML solutions, which could potentially have negative consequences for your business.

Now that we are aware of the advantages and disadvantages of end-to-end data science platforms, let's consider the ones that are available on the market today. Because the AI field is developing rapidly, new platforms emerge every year. We'll look into the top five such platforms.

Big tech giants, such as Microsoft, Google, and Amazon, all offer automated ML features that a lot of users might find useful. Google's AI Platform offers Kubeflow pipelines to help manage ML workflows. Amazon offers tools that assist with hyperparameter tuning and labeling. Microsoft offers Azure Machine Learning services that support GPU-based workflows and are similar to Amazon's services functionality.

However, as stated previously, all these Explainable AI (XAI) features are prone to bias and require the data science team to build additional tools that can verify model performance and reliability. For many organizations, automated ML is not the right answer. Another issue is vendor lock-in, as you will have to keep all your data in the underlying cloud storage.

The Databricks solution provides a more flexible approach as it can be deployed on any cloud. Databricks is based on Apache Spark, one of the most popular tools for AI and ML workflows and offers end-to-end ML pipeline management through a platform called MLflow. MLflow enables data scientists to track their pipeline progress from model development to deployment to production. Many users enjoy the built-in notebook interface. One disadvantage is the lack of data visualization tools, which might be added in the future.

Algorithmia is another proprietary solution that can be deployed on any cloud platform and that provides an end-to-end ML workflow with model training, deployment, versioning, and other built-in functionality. It supports batch processing and can be integrated with GitHub actions. While Algorithmia is a great tool, it has some of the traditional software developer tools built in, which some engineering teams might find redundant.

Pluggable solutions

While end-to-end platforms might sound like the right solution for your data science department, in reality, it is not always the case. Big companies often have requirements that end-to-end platforms cannot meet. These requirements might include the following:

Data security: Some companies might have privacy limitations on storing their data in the cloud. These limitations also apply to the use of automated ML features.
Pipeline outputs: Often, the final product of a pipeline is a library that is packaged and used in other projects within the organization.
Existing infrastructure constraints: Some existing infrastructure components might prevent the integration of an end-to-end platform. Some parts of the infrastructure might already exist and satisfy the user's needs.

Pluggable solutions give data infrastructure teams the flexibility to build their own solution, which also comes with the need to support it. However, most of the big companies end up doing just that.

Pluggable solutions can be divided into the following categories:

Data ingestion tools
Data transformation tools
Data serving tools
Data visualization and monitoring tools

Let's consider some of these tools, which can be combined together to build a data science solution.

Data ingestion tools

Data ingestion is the process of collecting data from all sources in your company, such as databases, social media, and other platforms, into a centralized location for further consumption by machine learning pipelines and other AI processes.

One of the most popular open source tools to ingest data is Apache NiFi, which can ingest data into Apache Kafka, an open source streaming platform. From there, data pipelines can consume the data for processing.

Among commercial cloud-hosted platforms, we can name Wavefront, which enables not only ingestion but data processing as well. Wavefront is notable for its ability to scale and support high query loads.

Data transformation tools

Data transformation is the process of running your code against the data you have. This includes training and testing your data as part of a data pipeline. The tool should be able to consume the data from a centralized location. Tools such as TensorFlow and Keras provide extended functionality for this type of operation.

Pachyderm is a data transformation and pipeline tool as well, although its main value is in version control for large datasets. Unlike other transformation tools, Pachyderm gives data scientists the freedom to define their own pipelines and supports any language and library.

If you have taken any data science classes, chances are you have used MATLAB or Octave for model training. These tools provide a great playground to start exploring machine learning. However, when it comes to production-grade data science that requires continuous training, collaboration, version control, and model productization, these tools might not be the best choice. MATLAB and Octave are mainly for numerical computing for academic purposes. Another issue with platforms such as MATLAB is that they often use proprietary languages, while tools like Pachyderm support any language, including the most popular ones in the data science community.

Model serving tools

After you train your model and it gives satisfactory results, you need to think about moving that model into production, which often is convenient to do in the form of a REST API or through a table that is ingested into a database. Depending on the language that is used in your model, serving a REST API can be as easy as using a web framework such as Flask.

However, there are more advanced tools that can that give data scientists end-to-end control over the machine learning process. One such open source tool is Seldon. Seldon converts REST API endpoints into a production microservice, where you can easily promote each version of your model from staging to production.

Another tool that provides similar functionality is KFServing. Both solutions use Kubernetes' Custom Resource Definition (CRD) to define a Deployment class for model serving.

Often, in big companies, different teams are responsible for training models and serving models, and therefore, decisions can be made based on the team's familiarity and preference for one or the other solution.

Data monitoring tools

After the model is deployed in production, data scientists need to continue to receive feedback about model performance, possible bias, and other metrics. For example, if you have an e-commerce website with a recommendation system that suggests to users what to buy with the current order based on their past choices, you need to make sure that the system is still on track with the latest fashion trends. You might not know the trends, but the feedback loop should signal a decrease in model performance when it occurs.

Often, enterprises fail to employ a good monitoring solution for ML workflows, which can have a potentially devastating outcome for your business. Seldon Alibi is one of the tools that provide model inspection functionality, which enables data scientists to monitor models running in production and identify areas of improvement. Seldon Alibi provides outlier detection, which helps to discover anomalies; drift detection, which helps monitor changes in correlation between input and output data; and adversarial detection, which exposes malicious changes in the original data inputs.

Fiddler is another popular tool that monitors a production model for integrity, bias, performance, and outlier anomalies.

Putting it all together

As you can see, there are multiple ways to create a production-grade data science solution, and one size likely will not fit all. Although end-to-end solutions provide the convenience of using one vendor, they also have multiple disadvantages and are likely equipped with inferior domain functionality compared to pluggable tools. Pluggable tools, on the other hand, require certain expertise and culture to be present in your organization, which will allow different departments, such as DevOps engineers, to collaborate with data scientists to build an optimal solution and workflow.

The next section will walk us through the ethical problems that plague modern AI applications, how they might affect your business, and what you can do about them.

Explaining ethical AI

This section describes aspects of ethical problems in AI and what organizations need to be aware of when they build artificial intelligence applications.

With AI and machine learning technologies becoming more widespread and accepted, it is easy to lose track of the data and the decision-making process origins. When an AI algorithm suggests which pair of shoes to buy based on your recent searches, it might not be a big deal. But suppose an AI algorithm is used to decide whether you qualify for a job, how likely you are to commit a crime, or whether you qualify for mortgage approval. In that case, it is essential to know how the algorithm was created, on which data it was trained, what was included in the dataset, and, more importantly, what was not. At a minimum, we need to question whether a proper process existed to validate the data used for producing the model. Not only is this the right thing to do, but it could also save your organization from undesirable legal consequences.

While AI applications bring certain advantages and improve the quality of our lives, they can make mistakes that sometimes can have adverse, and even devastating, effects on people's lives. These tendencies resulted in the emergence of ethical AI teams and ethical AI advocates in leading AI companies and big tech.

Ethical AI has been an increasingly discussed topic in the data science community over the last few years. According to the Artificial Intelligence Index Report 2019, ethics has been a steadily growing keyword in the total number of AI papers at leading AI conferences.

Figure 1.8 – Number of AI conference papers mentioning Ethics since 1970, from the AI Index 2019 Annual Report (p. 44)

Let's consider one of the most widely criticized AI technologies—facial recognition. A face recognition application can identify a person in an image or video. In recent years, this technology has become widespread and is now used in home security, authentication, and other areas. In 2018-2019, more than 1,000 newspaper articles worldwide mentioned facial recognition and data privacy. One such cloud-based facial recognition technology called Rekognition, developed by Amazon, has been used by police departments in a few states. The law enforcement departments used the software to search for suspects in a database, in a video surveillance analysis, including the feed from police body cameras. Independent research showed that the software was biased against people of color when out of 120 Members of Congress, it recognized 28 of them as potential criminals. All of them had darker skin tones. The tool performed especially poorly on identifying women of color.

The problem with this and other facial recognition technologies is that it was trained on a non-inclusive dataset that had photographs of mostly white men. Such outcomes are difficult to predict, but this is what ethical AI is trying to do. Implementing a surveillance system like that in public places would have negatively affected thousands of people. Advances in AI made facial recognition technology that requires little to no human involvement in subject identification. This raises the problem of total control and privacy. While a system like that could help identify criminals, possibly prevent crimes, and make our society safer, it needs to be thoroughly audited for potential errors and protected from misuse. With great power comes great responsibility.

Another interesting example is in using Natural Language Processing (NLP) applications. NLP is an ML technology that enables machines to automatically interpret and translate texts written in one human language to another. In recent years, NLP applications have seen major advances. Tools such as Google Translate solved a problem that was unsolvable even 20 years ago. NLP breaks down a sentence into chunks and tries to make connections between those chunks to provide a meaningful interpretation. NLP applications deal not only with translations but can also summarize what is written in a lengthy research paper or convert text to speech.

But these applications can make mistakes as well. One example was discovered in translations from Turkish to English. In the Turkish language, there is only the personal pronoun o, which can mean either she/her or he/his. It was discovered that Google Translate was discriminating based on gender, diminishing women's roles based on common stereotypes. For example, it would translate She is a secretary and He is a doctor, although in Turkish, both of these sentences could be written about a man or a woman.

From these examples, you can see that bias is one of the biggest problems of AI applications. A biased dataset is a dataset that does not include enough samples of a studied phenomenon to output an objective result, like in the facial-recognition example above, which did not have enough representatives of people of color to make a correct prediction.

While many companies are becoming aware of the adverse effects and risks of bias in datasets, few of them are taking steps to mitigate the possible negative consequences. According to the Artificial Intelligence Index Report 2019, only 13% of organizations that responded were working toward improving the equity and fairness of the datasets used:

Figure 1.9 – Types of organizations taking steps to mitigate the risks of AI, from the AI Index 2019 Annual Report (p. 102)

Another aspect of bias is financial inequality. It's not a secret that people from less economically advantageous backgrounds have harder times getting credit deals than those from a more fortunate background. Credit reports are known to have errors that cause higher borrowing rates.

Companies whose business is creating customer profiles, or personalization, go even further collecting intimate information about users and their behavior from public records, credit card transactions, sweepstakes, and other sources. These reports can be sold to marketers and even law enforcement organizations. Individuals are categorized according to their sex, age, marital status, wealth, medical conditions, and other factors. Sometimes these reports have outdated information about things such as criminal records. There was a case when an old lady could not get into a senior living house because of an arrest. However, though she was arrested, it was a case of domestic violence from her partner and she was never prosecuted. She was able to correct her police records, but not the report created by a profiling company. Correcting mistakes in the reports created by these companies is extremely difficult and they can affect people's lives for decades.

Sometimes, people get flagged because of a misidentified profile. Imagine that you are applying for a job and are denied because you have been prosecuted for theft or burglary in the past. This could come as a shock and might not make any sense, but there are cases like that with people who have common names. To clear a mistake like that you need the intervention of a person who wants to spend time correcting such mistakes for you. But do you meet people like that often?

With machine learning now being used in customer profiling, many data privacy advocates question the methods being used in these algorithms. Because these algorithms learn from past experiences, according to them anything you've done in the past, you are likely to repeat in the future. According to these algorithms, criminals will commit more crimes and the poor will get poorer. There is no room for mistakes in their reality. This means that people with prior convictions will likely get arrested again, which gives law enforcement a base for discrimination. The opposite is also true: those with a perfect record, from a better neighborhood, are not likely to commit a crime. This does not sound fair.

The problem with recidivism models is that most of them are proprietary black boxes. A black-box model is an end-to-end model that is created by an algorithm directly from the provided data and even a data scientist cannot explain how it makes decisions. When a machine learning algorithm evolves over time, since AI algorithms learn similarly to humans, they learn the same biases as us.

Figure 1.10 – Black-box model

Let's move on to the next section!

Trustworthy AI

While a few years ago, ethical AI was something only a few groups of independent advocates and academics were working on, today more and more big tech companies have established ethical AI departments to protect the companies from reputational and legal risks.

Establishing standards for trustworthy AI models is an ambitious task and one size does not fit all. However, the following principles apply to most cases:

Create an ethical AI committee that works on discussing the AI-associated risks in alignment with the overall company strategy.
Raise awareness of the dangers of non-transparent machine learning algorithms and the potential risks they pose to society and your organization.
Create a process of identifying, communicating, and evaluating biased models, and privacy concerns. For example, in healthcare, protecting patient personal information is vitally important. Create ownership around ethical risk in the product management department.
Establish a process of notifying users about how their data will be used, explaining the risk of bias and other concepts in plain English. The earlier the user becomes aware of the implications of using your application, the less legal risk this will pose in the future.
Build a culture around praising efforts to promote ethical programs and initiatives to motivate employees to contribute to those efforts. Engage employees from different departments, including engineering, data science, product management, and others, to contribute to those efforts.

According to the Artificial Intelligence Index Report 2019, the top AI ethics challenges include fairness, interpretability and explainability, and transparency.

The following figure shows a more complete list of challenges present in the ethical AI space:

Figure 1.11 – Ethical AI challenges, from the AI Index 2019 Annual Report (p. 149)

The following is a list of issues that non-transparent machine learning algorithms may cause:

Disproportional spread of economic and financial opportunities, including credit discrimination and unequal access to discounts and promotions based on predefined buying habits
Access to information and social circles, such as algorithms that promote news based on socio-economic groups and suggestions to join specific groups or circles
Employment discrimination, including algorithms that filter candidates based on their race, religion, or gender
Unequal use of police force and punishment, including algorithms that predict the possibility of an individual committing a crime in the future based on social status and race
Housing discrimination, including the denial of equal rental and mortgage opportunities to people of color, LGBT groups, and other minorities

AI has brought unprecedented benefits to our society similar to what the industrial revolution did. But with all these benefits, we should be aware of the societal changes that these benefits carry. If the future of driving is self-driving cars, this will mean that driving as a profession will disappear in the foreseeable future. Many other industries will be affected and will cease to exist. It does not mean that progress should not happen, but it needs to happen in a controlled way.

Software is only as perfect as its creators, and flaws in new AI-powered products are inevitable. But if these new applications are the first level in the decision-making process about human lives and destinies, there has to be a way to ensure that we minimize potential harmful consequences. Therefore, deeply understanding our models is paramount. Part of that is reproducibility, which is one of the key factors in minimizing the negative consequences of AI.

Summary

In this chapter, we have discussed a number of important concepts that help define why reproducibility is important and why it should be a part of a successful data science process.

We've learned that data science models are used to analyze historical data as input with a target goal to calculate the most probable and most successful result. We've established that replication, the ability to reproduce the results of a scientific experiment, is one of the fundamental principles of good research and that it is one of the best ways to ensure that your team is doing everything to reduce bias in your models. Bias can creep into a calculation from misrepresentation in a training dataset. Often, this reflects historical and social realities and norms accepted in society. Another way to reduce bias in your training data is to have a diverse team that includes representatives of all genders, races, and backgrounds.

We've learned that data dredging, or fishing, is an unethical technique used by some data scientists to prove a predefined hypothesis by cherry-picking the results of an experiment and only selecting the results that prove the desired outcome and ignoring any inconvenient trends.

We've also learned about the MLOps methodology, a lifecycle of a machine learning application, similar in its principle to the DevOps software lifecycle technique. MLOps includes the following main phases: planning, development, training, validation, deployment, and monitoring. All of the phases are continuously repeated, creating a feedback loop that ensures seamless experiment management from planning through development and testing to production and post-production phases.

We've also reviewed some of the most important aspects of ethical AI, a discipline of data science that focuses on ethical aspects of artificial intelligence, robotics, and data science. A failure to implement an ethical AI process in your organization might lead to undesirable legal consequences if deployed production models are found to be discriminatory.

In the next chapter, we will learn about the main concepts of the Pachyderm version-control system, which can help you address many of the issues described in this chapter.

Reproducible Data Science with Pachyderm

Chapter 1: The Problem of Data Reproducibility

Why is reproducibility important?

What is a model?

The main principles of reproducibility

The reproducibility crisis in science

Data fishing

Better reproducibility in science research guidelines

Common practices to improve reproducibility

Demystifying MLOps

Types of data science platforms

End-to-end platforms

Pluggable solutions

Data ingestion tools

Data transformation tools

Model serving tools

Data monitoring tools

Putting it all together

Explaining ethical AI

Trustworthy AI

Summary

Further reading