Challenges in Big Data and Traditional AI
In this introductory chapter, why federated learning (FL) is going to be a key technology in the 2020s is explained in detail. You will learn what big data is and how it has been problematic from the perspectives of data privacy, model bias, and drift. A solid understanding of the necessity of such issues and solutions for them will motivate you to embark on a challenging journey to acquire relevant knowledge and skills, using the following chapters to chart the mastery of FL. After reading this chapter, it will become obvious that there is a massive paradigm shift in artificial intelligence (AI) and machine learning (ML), which is happening due to public and business concerns over the current reliance on big data-oriented systems. Without further ado, let us depart!
In this chapter, we will cover the following topics:
- Understanding the nature of big data
- Data privacy as a bottleneck
- Impacts of training data and model bias
- Model drift and performance degradation
- FL as the main solution for data problems
Understanding the nature of big data
In the 2021 Enterprise Trends in Machine Learning survey conducted on 403 business leaders by Algorithmia, 76% of enterprises prioritized AI and ML over other IT initiatives. The global pandemic of COVID-19 necessitated some of those companies to hasten the development of AI and ML, as their chief information officers (CIOs) recounted, and 83% of the surveyed organizations increased their budget for AI and ML year-over-year (YoY), with a quarter of them doing so by over 50%. Customer experience improvement and process automation, either through increased revenue or reduced costs, were the main drivers of the change. Other studies, including KPMG’s latest report, Thriving in an AI World, essentially tell the same story.
The ongoing spree of AI and ML development, epitomized by deep learning (DL), was made possible by the advent of big data in the last decade. Provided with Apache’s open source software utilities Hadoop and Spark, as well as cloud computing services such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, organizations in both the private and public sectors can solve problems by handling a massive amount of data in ways unthinkable theretofore. Companies and bureaus no longer have to be overcautious in developing data analytic models and designing data warehouses upfront so that relevant data will be stored in appropriate formats. Instead, they can simply cascade available raw data into their data lake, expecting that their data scientists will find out variables that are valuable down the line by checking their correlation with one another.
Big data might seem to be the ultimate solution to a wide range of problems, but as we will see in the following sections, it has several inherent issues. To clearly understand what the issues with big data could be, let’s examine what exactly big data is first.
Definition of big data
Big data represents vast sets of information. This information is now growing at an exponential rate. Big data has become so large now as humans now produce two quintillion bytes of data daily. Thus, it is getting to be quite difficult to process big data very efficiently for an ML purpose with existing traditional tools for data management. Three Vs are commonly used to define the characteristics of big data, as presented here:
- Volume: Data from various sources such as business transactions, Internet of Things (IoT) devices, social media, industrial equipment, videos, and so on, contribute to the sheer amount of data.
- Velocity: Data speed is also an essential characteristic of big data. Often, data is needed in real time or near real time.
- Variety: Data comes in all formats, such as numeric data, text documents, images, videos, emails, audio, financial transactions, and so on.
The following screenshot describes the intersection of the three Vs as big data:
Figure 1.1 – Big data’s three Vs
In 1880, the United States (US) Census Bureau gathered a lot of data from the census and estimated that it would take 8 years to process that amount of data. The following year, a man named Herman Hollerith invented the Hollerith tabulating machine, which reduced the work needed to process the data. The first data center was built in 1965 to store fingerprint data and tax information.
Big data now
The introduction of data lakes as a concept played a key role in ushering in the massive scales we see when working with data today. Data lakes give companies total freedom to store arbitrary types of data observed during operation, removing a restriction that otherwise would have prevented the company from collecting some data that ends up being necessary in the future. While this freedom allows data lakes to maintain the maximum potential of the data generated by the company, it also can lead to a key problem—complacency in understanding the collected data. The ease of storing different types of data in an unstructured manner can actually lead to a store now, sort out later mentality. The true difficulty of working with unstructured data actually stems from its processing; thus, the delayed processing mentality has the potential to lead to data lakes that have become highly cumbersome to sift through and work with due to unrestricted growth from the collection of data.
Raw data is only as valuable as the models and insights that can be derived from it. The central data lake approach leads to cases where derivation from the data is limited by a lack of structure, leading to issues ranging from storage inefficiency to actual intelligence inefficiency due to extraction difficulties. On the other side, approaches preceding data lakes suffered from a simple lack of access to the amount of data potentially available. The fact that FL allows for both classes of problems to be avoided is the key driving support for FL as the vehicle that will advance big data into the collective intelligence era.
This claim is substantiated by the fact that FL flips the big data flow from collect → derive intelligence to derive intelligence → collect. For humans, intelligence can be thought of as the condensed form of large swaths of experience. In a similar way, the derivation of intelligence at the source of the generated data— done by training a model on the data at the source location—succinctly summarizes the data in a format that maximizes accessibility for practical applications. The late collection step of FL leads to the creation of the desired global intelligence with maximal data access and data storage efficiency. Even cases with partial usage of the generated data sources can still greatly benefit from the joint storage of intelligence and data by greatly reducing the number of data formats entering the residual data lake.
Triple-A mindset for big data
While many definitions have been proposed with emphasis on different aspects, Oxford professor Viktor Mayer-Schönberger and The Economist senior editor Kenneth Cukier brilliantly elucidated the nature of big data in their 2013 international bestseller, Big Data: A Revolution That Will Transform How We Live, Work, and Think?. It is not about how big the data in a server is; big data is about three major shifts of a mindset that are interlinked and hence reinforce one another. Their argument boils down to what we can summarize and call the Triple-A mindset for big data, which consists of an abundance of observations, acceptance of messiness, and ambivalence of causality. Let’s take a look at them one by one.
Abundance of observations
Big data doesn’t have to be big in terms of columns and rows or file size. Big data has a number of observations, commonly denominated as n, close or equal to the size of the population of interest. In traditional statistics, collecting data from the entire population—for example, people interested in fitness in New York—was not possible or feasible, and researchers would have to randomly select a sample from the population—for example, 1,000 people interested in fitness in New York. Random sampling is often difficult to perform and so is justifying the narrow focus on particular subgroups: surveying people around gyms would miss those who run in parks and practice yoga at home, and why gym goers rather than runners and yoga fans? Thanks to the development and sophistication of Information and Communications Technology (ICT) systems, however, researchers today can access the data of approximately all of the population through multiple sources—for example, records of Google searches about fitness. This paradigm of abundance or n = all is advantageous since what the data says can be interpreted as a true statement about the population, whereas the old method could only infer such truth with a significant level of confidence expressed in a p-value, typically supposed to be under 0.05. Small data provides statistics; big data proves states.
Acceptance of messiness
Big data tends to be messy. If we use Google search data as a proxy for someone’s interest—for example—we could mistakenly attribute some of the searches made by their family or friends on their devices to them, and the estimated interest will be inaccurate to the degree of the ratio of such unowned-device searches. In some devices, a significant amount of searches may be made by multiple users, such as shared computers at an office or a smartphone belonging to a child whose younger siblings are yet to own one. Otherwise, people may search for words that pop up in a conversation with someone else, rather than self-talk, which does not necessarily reflect their own interests. In studies using traditional methods, researchers would have to make sure that such devices are not included in their sample data because the mess can affect the quality of inference significantly, as the number of observations would be small. This is not the case in big data studies. Researchers would be willing to accept the mess as its effect diminishes proportionally as the number of observations becomes large enough toward n = all. In most devices, Google searches would be made by the owner autonomously most of the time, and the impact of searches in other contexts would not matter.
Ambivalence of causality
Big data is often used to study correlation but not causation—in other words, it usually does not tell why but only what. For many practical questions, correlation alone can provide the answer. Mayer-Schönberger and Cukier give several examples in the Big Data: A Revolution That Will Transform How We Live, Work, and Think book, among which is Fair Isaac Corporation’s Medication Adherence Score established in 2011. In an era where people’s behavioral patterns are datafied, collecting n = all observations for the variables of interest is possible, and the correlation found among them is powerful enough to direct our decision-making. There is no need to know people’s psychological scores of consistency or conformity that cause their adherence to medical prescriptions; by looking at how they behave in other aspects of life, we can predict whether they will follow the prescription or not.
By embracing the triple mindset of abundance, acceptance, and ambivalence, enterprises and governments have generated intelligence across tasks from pricing services to recommending products, optimizing transportation routes, and identifying crime suspects. Nevertheless, that mindset has been challenged in recent years, as shown in the following sections. First, let’s glimpse into how the abundance of observations often taken for granted is currently under pressure.
Data privacy as a bottleneck
FL is often said to be one of the most popular privacy-preserving AI technologies because private data does not have to be collected or shared with third-party entities to generate high-quality intelligence. Therefore, in this section, we discuss the data privacy that has been a bottleneck that FL tries to resolve to create high-quality intelligence.
What is data privacy? In May 2021, HCA Healthcare announced that the company had struck a deal to share its patient records and real-time medical data with Google. Various media quickly responded by warning the public about the deal, as Google had been mentioned for its Project Nightingale where the tech giant allegedly exploited the sensitive data of millions of American patients. Given above 80% of the public believes that the potential risks in data collection by companies outweigh the benefits, according to a 2019 poll by Pew Research Center, data sharing projects of such a scale are naturally seen as a threat to people’s data privacy.
Data privacy, also known as information privacy, is the right of individuals to control how their personal information is used, which mandates third parties to handle, process, store, and use such information properly in accordance with the law. It is often confused with data security, which ensures that data is accurate, reliable, and accessible only to authorized users. In the case of Google accounts, data privacy regulates how the company can use the account holders’ information, while data security requires them to deploy measures such as password protection and 2-step verification. In explaining these two concepts, the data privacy managers use an analogy of a window for security and a curtain for privacy: data security is a prerequisite for data privacy. Put together, they comprise data protection, as shown in the following diagram:
Figure 1.2 – Data security versus data privacy
We can see from the preceding diagram that while data security limits who can access data, data privacy limits what can be in the data. Understanding this distinction is very important because data privacy can multiply the consequences of failures in data security. Let’s look into how.
Risks in handling private data
Failing in data protection is costly. According to IBM’s Cost of a Data Breach Report 2021, the global average cost of a data breach in the year marked US dollars (USD) $4.24 million, which is considerably higher than $3.86 million a year earlier and is the highest amount in the 17-year history of the report; an increased number of people working remotely in the aftermath of the COVID-19 outbreak is considered a major reason for the spike. The top five industries for average total cost are healthcare, finance, pharmaceuticals, technology, and energy. Nearly half of breaches in the year included customer personally identifiable information (PII), which costs $180 per record on average. Once customer PII is breached, negativities such as system downtime during the response, loss of customers, need for acquiring new customers, reputation losses, and diminished goodwill ensue; hence, the hefty cost.
The IBM study also found that failing to comply with regulations for data protection was top among the factors that amplify data breach costs (https://www.ibm.com/downloads/cas/ojdvqgry).
Increased data protection regulations
As technology advances, the need to protect customer data has become more critical. Consumers require and expect privacy protection during every transaction; many simple activities can risk personal data, whether online banking or using a phone app.
Governments worldwide were initially slow to react by creating laws and regulations to protect personal data from identity theft, cybercrime, and data privacy violations. However, times are now changing as data protection laws are beginning to take shape globally.
There are several drivers for the increase in regulations. These include the growth of enormous amounts of data, and we need more data security and privacy to protect users from nefarious activities such as identity theft.
Let’s look at some of the measures taken toward data privacy in the following sub-sections.
General Data Protection Regulation (GDPR)
The General Data Protection Regulation (GDPR) by European Union is regarded as the first data protection regulation in the modern data economy and was emulated by many countries to craft their own. GDPR was proposed in 2012, adapted by the EU Council and Parliament in 2016, and enforced in May 2018. It superseded the Data Protection Directive that had been adopted in 1995.
What makes GDPR epoch-making is its stress on the protection of PII, including people’s names, locations, racial or ethnic origin, political or sexual orientation, religious beliefs, association memberships, and genetic/biometric/health information. Organizations and individuals both in and outside the EU have to follow the regulation when dealing with the personal data of EU residents. There are seven principles of GDPR, among which six were inherited from the Data Protection Directive; the new principle is accountability, which demands data users maintain documentation about the purpose and procedure of personal data usage.
GDPR has shown the public what the consequences of its violation can be. Depending on the severity of non-compliance, the GDPR fine can go from 2% of global annual turnover or €10 million, whichever is higher, or 4% of global annual turnover or €20 million, whichever is higher. In May 2018, thousands of Europeans filed a complaint against Amazon.com Inc. through the French organization La Quadrature du Net, also known as Squaring the Net in English, accusing the company of using its advertisement targeting system without customer consent. After 3 years of investigation, Luxembourg’s National Commission for Data Protection (CNDP) made headlines around the world: it issued Amazon a €746 million fine. Similarly, WhatsApp was fined by Ireland’s Data Protection Commission in September 2021 for GDPR infringement; again, the investigation had taken 3 years, and the fine amounted to €225 million.
Currently, in the US, a majority of states have privacy protections in place or soon will. Additionally, several states have strengthened existing regulations, such as California, Colorado, and Virginia. Let’s look at each to get an idea of these changes.
California Consumer Privacy Act (CCPA)
The state of California followed suit. The California Consumer Privacy Act (CCPA) became effective on January 1, 2020. As the name suggests, the aim of the regulation is to protect consumers’ PII just as GDPR does. Compared to GDPR, the scope of the CCPA is significantly limited. The CCPA is applicable only to for-profit organizations that collect data from over 50,000 points (residents, households, or devices in the state) in a year, generate annual revenue over $25 million, or make half of their annual revenue by selling such information. However, CCPA infringement can be much more costly than GDPR infringement since the former has no ceiling for its fine ($2,500 per record for each unintentional violation; $7,500 per record for each intentional violation).
Colorado Privacy Act (CPA)
Under the Colorado Privacy Act (CPA), starting July 1, 2024, data collectors and controllers will have to follow universal opt-outs that users have selected for generating targeted advertising and sales. This rule protects residents in Colorado from targeted sales and advertising as well as certain types of profiling.
Virginia Consumer Data Protection Act (CDPA)
Virginia’s Consumer Data Protection Act (CDPA) will make several changes to increase security and privacy on January 1, 2023. These changes will be applicable to organizations that do business in Virginia or with residents in Virginia. Data collectors need to obtain approval to utilize their private data. These changes also try to determine the adequacy of privacy and security of AI vendors, which may require the removal of that data.
These are just a few simple examples of how data regulations will take shape in the US. What does this look like for the rest of the world? Some estimate that by 2024, 75% of the global population will have personal data covered by privacy regulations of one type or another.
Another example of major data protection regulation is Brazil’s Lei Geral de Proteção de Dados Pessoais (LGPD) which has been in force since September 2020. It replaced dozens of laws in the country related to data privacy. LGPD was modeled after GDPR, and the contents are almost identical. In Asia, Japan was the first country to introduce a data protection regulation: the Act on the Protection of Personal Information (APPI) was adopted in 2003 and amended in 2015. In April 2022, the latest version of APPI was put in force to address modern concerns over data privacy.
From privacy by design to data minimalism
Organizations have been acclimatizing to these regulations. TrustArc’s Global Privacy Benchmarks Survey 2021 found that the number of enterprises with a dedicated privacy office is increasing: 83% of respondents in the survey had a privacy office, whereas the rate was only 67% in 2020. 85% had a strategic and reportable privacy management program in place, yet 73% of them believed that they could do more to protect privacy. Their eagerness is hardly surprising as 34% of the respondents claimed that they had faced a data breach in the previous 3 years, the costly consequences of which was mentioned previously in this chapter. A privacy office would be led by a data protection officer (DPO) who is responsible for the company’s Data Protection Impact Assessment (DPIA) in order to comply with regulations such as GDPR that demand accountability and documentation of personal data handling. DPOs are also responsible for monitoring and ensuring that personal data is treated by their organizations in compliance with the law, and the top management and board are supposed to provide necessary support and resources to DPOs to allow them to complete their task.
In the face of GDPR, the current trend in data protection is shifting toward data minimalism. Data minimalism in this context does not necessarily encourage minimization of the size of data; it pertains more directly to minimizing PII factors in data so that individuals cannot be identified with its data points. Therefore, data minimalism affects AI sectors in their ability to create a high-performing AI application because a shortage in data variety for the ML process simply generates ML model biases with unsatisfying performance in prediction.
The abundance mindset for big data introduced at the beginning of the chapter has thus been disciplined by the public concern over data privacy. The risk of being fined for violating data protection regulations, coupled with the wasteful cost of having a data graveyard, calls for practicing data minimalism rather than data abundance.
That is why FL is becoming a must-have solution for many AI solution providers such as medical sectors that are struggling with public concerns and data privacy, which basically becomes an issue when a third-party entity needs to collect private data for improving the quality of ML models and their applications. As mentioned, FL is a promising framework for privacy-preserving AI because learning of the data can happen anywhere; even if the data is not available for the AI service providers, all we have to do is collect and aggregate trained ML models in a consistent way.
Now, let’s consider another facet of the Triple-A mindset for big data being challenged: acceptance of messy data.
Impacts of training data and model bias
The sheer volume of big data annihilates the treacherous reality of garbage in, garbage out. Or does it? In fact, the messiness of data can only be accepted if enough data from a variety of sources and distributions can be fully learned without causing any biases in the outcomes of the learning. The actual training of the big data in a centralized location does take a lot of time and huge computational resources and storage. Also, we would probably have to find methods to measure and reduce model bias without directly collecting and accessing sensitive and private data, which would conflict with some of the privacy regulations discussed previously. FL also has an aspect of distributed and collaborative learning, which becomes critical to eliminate data and model bias to absorb the messiness of the data. With collaborative and distributed learning, we could significantly increase the data accessibility and efficiency of an entire learning process that is often very expensive and time-consuming. It gives us a chance to break through the limitation that big data training used to have, as discussed in the following sections.
Expensive training of big data
According to the report: https://www.flexera.com/blog/cloud/cloud-computing-trends-2022-state-of-the-cloud-report, 37% of enterprises annually spend more than $12 million and 80% spend over $1.2 million per year for public cloud. The training cost over the cloud is not cheap, and it can easily be assumed that this cost is going to boost significantly, together with the increasing demand for AI and ML. Sometimes, big data cannot be fully trained for ML because of the following issues:
- Big data storage: Big data storage is an architecture for compute and storage that collects and manages large amounts of datasets for AI applications or real-time analytics. Worldwide enterprise companies are paying more than $100 billion just for cloud storage and data center costs (https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap-cloud-lifecycle-scale-growth-repatriation-optimization/). While some of the datasets are critical for the applications they provide, what they really want is often business intelligence that can be extracted from the data, not just the data itself.
- Significant training time: Building and training an ML model that can be delivered as an authentic product basically takes a significant amount of time, not only for the training process but also for the preparation of the ML pipelines. Therefore, in many cases, the true value of the intelligence is going to be lost by the time the ML model is delivered.
- Huge computation: Training of an ML model often consumes significant computational resources. For example, an ML task of manipulating pieces such as a Rubik’s Cube using a robotic hand could sometimes require more than 1,000 computers. It could also take a dozen machines just to run some specialized graphics chips for several months.
- Communications latency: To form big data, especially in the cloud, a significant amount of data needs to be transferred to the server, which in itself causes communications latency. In most use cases, FL requires much less data to be transferred from local devices or learning environments to a server called an aggregator that is there to synthesize the local ML models collected from those devices.
- Scalability: In traditional centralized systems, scalability becomes an issue because of the complexity of big data and its costly infrastructures such as huge storage and computing resources in the cloud server environment. In an FL server, only an aggregation is conducted to synthesize the multiple local models that have been trained to update the global model. Therefore, both the system and learning scalability increase significantly as ML training is conducted on edge devices in a distributed manner, not only in a single centralized learning server.
FL effectively utilizes distributed computational resources that can be used for light training of the ML models. Whether training happens on actual physical devices or virtual instances of the cloud system, parallelizing the model training process into distributed environments often accelerates the speed of learning itself.
In addition, once the trained models are collected, the FL system can quickly synthesize them to generate an updated ML model called a global model that absorbs enough learnings at the edge sides, and thus delivering the intelligence in near real time is possible.
Model bias and training data
Yann LeCun, the 2018 Turing Award winner for his outstanding contribution to the development of DL, says “ML systems are biased when data is biased” (https://twitter.com/ylecun/status/1274782757907030016). This comes from a computer vision (CV) model trained with the
Flickr-Faces-HQ dataset compiled by the Nvidia team. Based on the face upsampling system, many people are classified as white as the network was pre-trained on
Flickr-Faces-HQ data mainly containing pictures of white people. For this problem of misclassification of the people, the architecture of the model is not the issue that mandates this output. Hence, the conclusion is that a racially skewed dataset generated a neutral model to produce biased outcomes.
Productive conversations about AI and ML biases have been led by the former lead of AI Ethics at Google. The 2018 publication of the Gender Shades paper demonstrated race and gender bias in major facial recognition models, and lawmakers in Congress have sought to prohibit the use of the technology by the US federal government. Tech companies including Amazon, IBM, and Microsoft also agreed to suspend or terminate sales of facial recognition models to the police. They are encouraged to use an interventionist approach to data collection by advising scientists and engineers to specify the objectives of model development, form a strict policy for data collection, and conduct a thorough appraisal of collected data to avoid biases—details are available on the FATE/CV website (https://sites.google.com/view/fatecv-tutorial/home).
FL could be one of the most promising ML technologies to overcome data-silo issues. Very often, the data is not even be accessible or usable for the training, causing a significant bias in data and models. Naturally, FL is useful for overcoming bias by resolving the issues of data privacy and silos that become the bottleneck to fundamentally avoiding data bias. In this context, FL is becoming a breakthrough in the implementation of big data services and applications, as thoroughly investigated in https://arxiv.org/pdf/2110.04160.pdf.
Also, there are several techniques that try to mitigate model bias in FL itself, such as Reweighing and Prejudice Remover, both detailed in https://arxiv.org/pdf/2012.02447.pdf.
Model drift and performance degradation
Model drift is generally about the degradation of ML model performance because of changes in data and relationships between input and output (I/O) variables, known as model decay, as well. Model drift can be addressed by continuous learning to adapt to the latest changes in datasets or environments in near real time. One of the important aspects of FL is realizing a continuous learning framework by updating an ML model instantly whenever the learning happens in the local distributed environment anytime, in a consistent manner. That way, FL could resolve the situation often seen in enterprise AI applications where the intelligence is useless by the time it is delivered for production.
We will now touch on how models could get degraded or stop working, and then some of the current efforts of model operations (ModelOps) to continuously improve the performance of models and achieve sustainable AI operations.
How models can stop working
Any AI and ML model with fixed parameters, or weights, generated from the training data and adjusted to the test data can perform fairly well when deployed in an environment where the model receives data similar to the training and test data. If an autonomous driving model is well trained with data recorded during sunny daytime, the model can drive vehicles safely on sunny days because it is doing what it has been trained to do. On a rainy night, however, nobody should be in or near the vehicle if it is autonomously driven: the model is fed with totally unfamiliar, dark, and blurry images; its decisions will not be reliable at all. In such a situation, the model’s decision will be far off the track, hence the name model drift. Again, model drift is not likely to happen if the model is deployed in an environment similar to the training and testing environment and if the environment does not change significantly over time. But in many business situations, that assumption does not always hold, and model drift becomes a serious issue.
There are two types of model drifts: data drift and concept drift. Data drift happens when input data to a deployed model is significantly different from the data the model has been trained with. In other words, changes in data distribution are the cause of data drift. The aforementioned diurnal autonomous vehicle model not performing well in the nighttime is an example of data drift. Another example would be an ice-cream sale prediction model trained in California being deployed in New Zealand; seasonality in the southern hemisphere is opposite to that in the northern hemisphere, and the estimated sales of ice cream will be low for summer and high for winter, on the contrary to the actual sales volume.
Concept drift, on the other hand, is a result of changes in how variables correlate with each other. In the terminology of statistics, this implies that the data-generating process has been altered. And this is what Google Flu Trends (GFT) suffered from, as the author of The Undercover Economist put it in the following Financial Times article: https://www.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0#axzz30qfdzLCB.
Prior to the period, search queries were meaningfully correlated with the spread of flu as mainly people who suspected that they were infected typed those words in the browser, and therefore the model worked successfully. This may no longer have been the case in 2013 since people in other categories, such as those who were precautious about a potential pandemic or those who were just curious, were searching for those words, and they may have been led to do so by Google’s recommendations. This concept drift likely made GFT overestimate the spread vis-à-vis medical reports provided by the Centers for Disease Control and Prevention (CDC).
Either by data or by concept, model drift causes model performance degradation, and it occurs because of our focus on correlation. The ground truth in data science parlance does not mean something like the universal truth in hard science such as physics and chemistry—that is, causation. It is merely a true statement about how variables in given data correlate with each other in a particular environment, and it provides no guarantee that the correlation holds when the environment changes or differs. This is to say that what we estimate as the ground truth can vary over time and locations, just like the ground has been reshaped by seismic events throughout history and geography.
Continuous monitoring – the price of letting causation go
In a survey commissioned by Redis Labs (https://venturebeat.com/business/redis-survey-finds-ai-is-stressing-it-infrastructure-to-breaking-point/), about half of the respondents cited model reliability (48%), model performance (44%), accuracy over time (57%), and latency of running the model (51%) as the top challenges for getting models deployed. Given the risk and concern of model drift, AI and ML model stakeholders need to work on two additional tasks after deployment. First, model performance must be continuously monitored to detect model drift. Both data drift and concept drift can take place gradually or suddenly. Once model drift is detected, the model needs to be retrained with new training data, and when concept drift occurs, even the use of a new model architecture may be necessary to upgrade the model.
In order to address these requirements, a new ML principle called Continuous Delivery for Machine Learning (CD4ML) has been proposed. In the framework of CD4ML, a model is coded and trained with training data in the first step. The model is then tested with a separate dataset and evaluated based on some metrics, and more often than not, the best model is selected from multiple candidates. Next, the selected model is productionized with a further test to make sure that the model performs well after the deployment, and once it passes the test, it is deployed. Here, the monitoring process starts. When model drift is observed, the model will be retrained with new data or given a new architecture, depending on the severity of the drift. If you are familiar with software engineering, you might have noticed that CD4ML is the adoption of continuous integration/continuous delivery (CI/CD) in the field of ML. In a similar vein, ModelOps, an AI and ML operational framework stemming from the development-operations (DevOps) software engineering framework is gaining popularity. ModelOps bridges ML operations (MLOps: the integration of data engineering and data science) and application engineering; it can be seen as the enabler of CD4ML.
The third factor of the Triple-A mindset for big data lets us focus on correlation and has helped in building AI and ML models rapidly over the last decade. Finding correlation is much easier than discovering causation. For many AI and ML models that have been telling us what we need to know from people’s Google search patterns over years, we have to check if it still works today. And so do we tomorrow.
That is why FL is one of the important approaches for continuous learning. When creating and operating an FL system, it is also important to develop the system with ModelOps functionalities, as the critical role of FL is to keep improving models constantly from various learning environments in a collaborative manner. It is even possible to realize a crowdsourced learning framework with FL so that people in the platform can take the desired ML model to adapt and train it locally and return an updated model to the FL server with an aggregator. With an advanced model aggregation framework to filter out poisonous ML models that could potentially degrade the current models, FL can consistently integrate other learnings, and thus realize a sustainable continuous learning operation that is key for the platform with ModelOps functionalities.
FL as the main solution for data problems
So far in this chapter, we confirmed that big data has issues to be addressed. Data privacy must be preserved in order to protect not only individuals but also data users who would face risks of data breaches and subsequent fines. Biases in a set of big data can affect inference significantly through proxies, even when factors about gender and race are omitted, and focus on correlation rather than causation makes predictive models vulnerable to model drift.
Here, let us discuss the difference between a traditional big data ML system and an FL system in terms of their architectures, processes, issues, and benefits. The following diagram depicts a visual comparison between a traditional big data ML system and an FL system:
Figure 1.3 – Comparison between traditional big data ML system and FL system
In the traditional big data system, data is gathered to create large data stores. These large data stores are used to solve a specific problem using ML. The resulting model displays strong generalizability due to the volume of data it is trained on and is eventually deployed.
However, continuous data collection uses large amounts of communication bandwidth. In privacy-focused applications, the transmission of data may be banned entirely, making model creation impossible. Training large ML models on big data stores is computationally expensive, and traditional centralized training efficiency is limited by single-machine performance. Slow training processes lead to long delays between incremental model updates, leading to a lack of flexibility in accommodating new data trends.
On the other hand, in an FL system, ML training is performed directly at the location of the data. The resulting trained models are collected at the central server. Aggregation algorithms are used to produce an aggregated model from the collected models. The aggregated model is sent back to the data locations for further training.
FL approaches often incur overhead to set up and maintain training performance with distributed-system settings. However, even with a bit more complicated architecture and settings, there are benefits that excel its complication. Training is performed at the data location, so data is never transmitted, maintaining data privacy. Training can be performed asynchronously across a variable number of nodes, which results in efficient and easily scalable distributed learning. Only model weights are transmitted between server and nodes, thus FL is efficient in communication. Advanced aggregation algorithms can maintain training performance even in restricted scenarios and increase efficiency in standard ML scenarios too.
The vast majority of all AI projects do not seem to be delivered, or simply fall short altogether. To deliver an authentic AI application and product, all the issues discussed previously need to be considered seriously. It is obvious that FL, together with other key technologies to deal with local data processed by the ML pipeline and engine, is getting to be a critical solution to resolve data-related problems in a continuous and collaborative manner.
How can we harness the power of AI and ML to optimize the technical system for society in its entirety—that is, bring about a more joyous, comfortable, convenient, and safe world while being data minimalistic and ethical, as well as delivering improvements continuously? We contend that the key is a collective intelligence or intelligence-centric platform, also discussed in Chapter 10, Future Trends and Developments. In subsequent chapters of the book, we introduce the concept, design, and implementation of an FL system as a promising technology for orchestrating collective intelligence with networks of AI and ML models to fulfill those requirements discussed so far.
This chapter provided an overview of how FL could potentially solve many of the big data issues by first understanding the definition of big data and its nature, involving an abundance of observations, acceptance of messiness, and ambivalence of causality.
We have learned about privacy regulations in a variety of forms from many regions and the risk of data breaches and privacy violations that eventually lead to loss of profits, as well as a bottleneck in creating authentic AI applications. Federated learning, by design, will not collect any raw data and can preserve data privacy and follow those regulations.
In addition, with an FL framework, we can reduce inherent bias that affects the performance of ML models and minimize model drift with a continuous learning framework. Thus, a distributed and collaborative learning framework such as FL is required for a more cost-effective and efficient approach based on FL.
This introductory chapter concluded with the potential of FL as a primary solution for the aforementioned big data problems based on the paradigm-shifting idea of collective intelligence that could potentially replace the current mainstream data-centric platforms.
In the next chapter, we will see where in the landscape of data science FL fits and how it can open a new era of ML.
To learn more about the topics that were covered in this chapter, please take a look at the following references:
- Algorithmia. (2021). 2021 Enterprise Trends in Machine Learning. Seattle: Algorithmia.
- Mayer-Schönberger, V. and Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live, Work, and Think. Boston/New York: Eamon Dolan/Houghton Mifflin Harcourt.
- The Economist. (2010, February 27). Data, data everywhere. The Economist.
- Data Privacy Manager. (2021, October 1). Data Privacy vs. Data Security [definitions and comparisons]. Data Privacy Manager.
- IBM. (2021). Cost of a Data Breach Report 2021. New York: IBM.
- Burgess, M. (2020, March 24). What is GDPR? The summary guide to GDPR compliance in the UK. Wired.
- TrustArc. (2021). Global Privacy Benchmarks Survey 2021. Walnut Creek: TrustArc.
- Auxier, B., Rainie, L., Anderson, M., Perrin, A., Kumar, M. and Turner, E. (2019, November 15). Americans and Privacy: Concerned, Confused and Feeling Lack of Control Over Their Personal Information. Pew Research Center.
- Hes, R. and Borking, J. (1995). Privacy-Enhancing Technologies: The Path to Anonymity. Hague: Information and Privacy Commissioner of Ontario.
- Goldsteen, A., Ezov, G., Shmelkin, R., Moffie, M. and Farkash, A. (2021). Data minimization for GDPR Compliance in machine learning models. AI and Ethics, 1-15.
- Knight, W. (2019, November 19). The Apple Card Didn’t ‘See’ Gender—and That’s the Problem. Wired.
- Gebru, T. and Denton, E. (2020). Tutorial on Fairness Accountability Transparency and Ethics in Computer Vision at CVPR 2020. Available online at https://sites.google.com/view/fatecv-tutorial/home.
- Ukanwa, K. (2021, May 3). Algorithmic bias isn’t just unfair — it’s bad for business. The Boston Globe.
- O’Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York: Crown.
- Blackman, R. (2020, October 15). A Practical Guide to Building Ethical AI. Harvard Business Review.
- Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L., Smolinski, M. S. and Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014.
- Anderson, C. (2008, June 23). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired.
- Butler, D. (2013). When Google got flu wrong. Nature 494, 155–156.
- Harford, T. (2014, March 28). Big data: are we making a big mistake?. Financial Times.
- Dral, E. and Samuylova, E. (2020, November 12). Machine Learning Monitoring, Part 5: Why You Should Care About Data and Concept Drift. Evidently AI Blog.
- Forrester Consulting. (2021). Deploy ML Models To In-Memory: Databases For Blazing Fast Performance. Retrieved from https://redis.com/wp-content/uploads/2021/06/forrester-ai-opportunity-snapshot.pdf.
- Sato, D., Wider, A. and Windheuser, C. (2019, September 19). Continuous Delivery for Machine Learning: Automating the end-to-end lifecycle of Machine Learning applications. Retrieved from martinFowler.com at https://martinfowler.com/articles/cd4ml.html.
- Verma, D. C. (2021). Federated AI for Real-World Business Scenarios. New York: CRC Press.
- Bostrom, R. P. and Heinen, J. S. (1977). MIS problems and failures: A socio-technical perspective. Part I: The causes. MIS Quarterly, 1(3), pp. 17.
- Weld, D. S., Lin, C. H. and Bragg, J. (2015). Artificial intelligence and collective intelligence. Handbook of Collective Intelligence, 89-114.
- Abay, A., Zhou, Y., Baracaldo, N., Rajamoni, S., Chuba, E. and Ludwig, H. Mitigating Bias in Federated Learning. Available at https://arxiv.org/pdf/2012.02447.pdf.
- Big Data: A Revolution That Will Transform How We Live, Work, and Think (https://www.amazon.com/Big-Data-Revolution-Transform-Think/dp/0544227751