In this chapter, we will discuss some concepts and challenges associated with analytics on Internet of Things (IoT) data. We will cover the following topics:
- The analytics maturity model
- Defining the IoT
- What is different about IoT data?
- Data volume
- Problems with time
- Problems with space
- Data quality
- Analytics challenges
- Concerns with finding business value
The tense white-yellow of the fluorescent ceiling lights press down on you while you sit in your cubicle and stare at the monitors on your desk. You sense it is now night outside but can't see over the fabric walls to know for sure. You stare at the long list of filenames on one screen and the plain text rows of opaque sensor data on the other screen.
Your boss had just left to angrily brood somewhere in the office, and you are not sure where. He had been glowering over your shoulder.
"We spent $20 million in telecommunication and consulting fees last year just to get this data! The hardware costs $20 per unit. We've been getting data, and it has been piling up costing us $10,000 a month. There are 20 TB of files - that's big data, isn't it? And we can't seem to do anything with it?"he had said.
"This is ridiculous!," he continued, "It was supposed to generate $100 million in new revenue. Where is our first dollar? Why can't you do anything with it? I have five consultants a week calling me to tell me they can handle it- they'll even automate it. Maybe we should just pick one and hope they aren't selling us snake oil."
You know he does not really blame you. You were a whiz with Excel and knew how to query databases. A lot of analytics requests went to you. When the CEO decided the company needed a big data guy, they hired a VP out of Silicon Valley. But the new VP ended up taking a position with a different Silicon Valley company the day before he was supposed to start at your company.
You were hastily moved into the new analytics group. A group of one - you. It was to be a temporary shift until another VP was found. That was six months ago. The company is freezing funds for outside training and revenues are looking tight. So, no training for you.
Although many know the terms, no one in the company actually understands what Hadoop is or how to even start using this thing called machine learning. But others more and more seem to expect you to not only know it but already be doing it.
Executives have been reading articles in HBR and Forbes about the huge potential of the IoT combined with Artificial Intelligence or AI. They feel like the company will be left behind, and soon, if it does not have its own IoT big data solution incorporating AI. Your boss is feeling the pressure. Executives have several ideas for him where AI can be used. They seem to think that getting the idea is the hard part, implementation should be easy. Your boss is worried about his job and it rolls downhill to you.
Your screen on the left looks like this:
The list goes on and on for several pages. You have been able to combine several files and do some pivot tables and charting in Excel. But it takes a lot of your time, and you can only realistically handle a month or two worth of data. The questions are coming in faster than your ability to answer them. You and your boss have been talking about bringing in temps to do the work–they don't really need to understand it, just follow the steps that you outline for them.
Your screen on the right looks like this:
Your IT department has been consolidating lots of little files into several very large ones. The filesystem was being overloaded by the number of files, so the solution was to consolidate. Unfortunately, for you, many files are now too large to open in Excel, which limits what you can do with them. You end up doing more analytics on recent data simply because it is much easier (the files are still small).
Looking at the data rows, it is not obvious what you can do with it beyond sums and averages. The files are too big to do a VLOOKUP in Excel against something like your production records - which is stored in files often too big to even open in Excel.
At this point, you can't begin to think how you would apply Machine Learning to this data. You are not quite sure what it even means. You know the data is difficult to manipulate for anything beyond recent datasets. Surely, long periods of time would be needed to extract value out of it.
You hear a cough from behind you. Your boss is back.
He says quietly and stiffly, "I'm sorry. We're going to have to hire a consultant to take this over. I know how hard you've been working. You've done some amazing things considering the limitations, and nobody appreciates that enough. But I have to show results. It will probably take a month or two to fully bring someone on board. In the meantime, just keep at it–maybe we can make a breakthrough before then."
Your heart sinks. You are convinced there is huge value in the connected device data. You feel like you could make a career out of IoT analytics if you could just figure out how to get there. But you are not a quitter.
You decide you will not go down without a fight, you will find a way.
In order to understand IoT analytics, it is helpful to separate it out and define both analytics and the IoT. This will help frame the discussion for the rest of the book.
If you ask a hundred people to define analytics, you are likely to get a hundred different answers. Each person tends to have his or her own definition in mind that can range from static reports to advanced deep learning expert systems. All tend to call efforts in the wide ranging territory analytics without much further explanation.
We will take a fairly broad definition in this book as we are covering quite a bit of territory. In their best selling book Competing on Analytics, Tom Davenport and Jeanne Harris created a scale, which they called Analytics Maturity. Companies progress to higher levels in the scale as their use of analytics matures, and they begin to compete with other companies by leveraging it.
When we use the word analytics, we will mean using techniques that fall in the range from query/drill down to optimization as shown in the following chart from Competing on Analytics:
We will also take a slightly different philosophy. Unlike the notion of a company progressing through each level to get to the peak of maturity at the upper right with optimization, we will strive to reach success at all levels in parallel.
The idea of a company not being analytically mature unless it is actively employing optimization models at every turn can be dangerous. This puts pressure on a company to focus time and resources where there may not be a return on investment (ROI) for them. Since resources are always limited, this could also cause them to under-invest in projects in other areas that have a higher ROI.
The reason for the lack of ROI is often that a company simply does not have the right data to take full advantage of the more advanced techniques. This could be no fault of their own as the signal in the noise may be just too weak to tease out. This could stem from the state of technology, not yet at the point where the key predictive data can even be monitored. Or even if this is possible, it may be far too expensive to justify capturing it. We will talk about the limitations of available data quite a bit in this book. The goal will always be to maximize ROI at all levels of the maturity model.
We will also take the view that analytics maturity is about having the capability and knowing how to enable the full scale. It is not about what you are doing. It is about what you are capable of doing in order to maximize your sum total ROI across the full scale. Each level can be exploited if an opportunity is spotted. And we want there to be fertile ground for opportunities across the full scale. More about this will be covered throughout the book.
Sensors have been tracking data for decades at manufacturing plants, retail stores, and remote oil and gas equipment. Why all of sudden is there this IoT hype all over the media?
The dramatic decrease in sensor costs, bandwidth costs, the spread of cellular coverage, and the rise of cloud computing all combine to create fertile conditions to easily connect devices over the internet. For example, as shown in the following graph, Goldman Sachs predicts an average sensor cost in 2020 of under $0.40 USD, 30% of what it was in 2004. Whether all these devices should be connected or not is hotly debated:
Data source: Goldman Sachs, BI estimates
The definitions of IoT seem to vary quite a bit; some include machine sensors only, others include RFID tags and smartphones.
We will use this definition from Forrest Stroud on Webopedia:
The IoT refers to the ever-growing network of physical objects that feature an IP address for the internet connectivity and the communication that occurs between these objects and other internet-enabled devices and systems.
Or to get even more basic: stuff that talks to other stuff over the internet without requiring you to do anything. This clears it up, right?
Even the number of things projected to be connected by 2020 varies widely. Some sources project 20.8 billion devices, others project up to 50 billion - over twice the amount.
For our purposes, we are more concerned with how to analyze the data generated than we are about the scope of devices that should be considered part of the IoT. If something sends data remotely by way of the internet, it is fair game for us, especially if it is machine-generated on one end and machine-consumed on the other.
We are more concerned with how to extract value from the data and adapt to circumstances inherent to it. IoT is not really new, as elements of it have been developing for decades. Remote detection of oil well spills was happening in the 1970s. GPS-based vehicle telematics has been around for 20 years. IoT is also not a separate market; it blends into current products and processes. Although much of the media reports on it as if it is a different animal (perhaps even the author of this book - guilty as charged?), you should not think of it this way.
The term constrained is an important concept in understanding IoT devices, data, and impacts on analytics. It refers to the limited battery power, bandwidth, and hardware capability that has to be considered in the design of IoT devices. For many IoT use cases, one or more of these has to be balanced with the need to record useful data.
There are some special challenges that come along with IoT data. The data was created by devices operating remotely, sometimes in widely varying environmental conditions that can change from day to day. The devices are often distributed widely geographically.
The data is communicated over long distances, often across different networking technologies. It is very common for data to first transmit across a wireless network, then through a type of gateway device to be sent over the public internet–which itself includes multiple different types of networking technology working together.
A company can easily have thousands to millions of IoT devices with several sensors on each unit, each sensor reporting values on a regular basis. The inflow of data can grow quite large very quickly. Since IoT devices send data on an ongoing basis, the volume of data in total can increase much faster than many companies are used to.
To demonstrate how this can happen, imagine a company that manufactures small monitoring devices. It produces 12,000 devices a year, starting in 2010 when the product was launched. Each one is tested at the end of assembly and the values reported by the sensors on the device are kept for analysis for five years. The data growth looks like the following image:
A chart showing data storage needs for production snapshot of 200 KB and 1,000 units per month. Five years of production data is kept
Now, imagine the device also had internet connectivity to track sensor values, and each one remains connected for two years. Since the data inflow continues well after the devices are built, data growth is exponential until it stabilizes when older devices stop reporting values. This looks more like the blue area in the following chart:
Chart shows the addition of IoT data at 0.5 KB per message, 10 messages per day. Devices are connected for two years from production
In order to illustrate how large this can get, consider the following example. If you capture 10 messages per day and the message size is half of a full production snapshot, by 2017, data storage requirements would be over 1,500 times higher than production-only data.
For many companies, this introduces some problems. The database software, storage infrastructure, and available computing horsepower is not typically intended to handle this kind of growth. The licensing agreements with software vendors tends to be tied to the number of servers and CPU cores. Storage is handled by standard backup planning and retention policies.
The data volume rapidly leads to computing and storage requirements well beyond what can be held by a single server. It gets cost prohibitive very quickly under traditional architectures to distribute it across hundreds or thousands of servers. To do the best analytics, you need lots of historical data, and since you are unlikely to know ahead of time which data is most predictive, you have to keep as much as you can on hand.
With large-scale data, computing horsepower requirements for analytics are not very predictable and change dramatically depending on the question being asked. Analytic needs are very elastic. Traditional server planning ratchets up on premise resources with the anticipated number of servers needed to meet peak needs determined in advance. Doubling compute power in a short amount of time, if even possible, is very expensive.
IoT data volumes and computing resource requirements can quickly outpace all the other company data needs combined.
The only reason for time is so that everything doesn't happen at once.
– Albert Einstein
Time is very tightly tied to geographical position and the date on the calendar. The international standard way of tracking a common time is using Coordinated Universal Time (UTC). UTC is geographically tied to 00 longitude, which passes through Greenwich, England, in the UK. Although it is tied to the location, it is actually not the same as Greenwich Mean Time (GMT). GMT is a time zone, while UTC is a time standard. UTC does not observe Daylight Savings Time (DST):
Standard time zones of the World. Source: CIA Factbook
When data used for analytics is recorded at headquarters or a manufacturing plant, everything happens at the same place and time zone. IoT devices are spread out across the globe. Events that happen at the absolute same time do not happen at the same local time. How time is recorded affects the integrity of the resulting analytics.
When IoT devices communicate sensor data, time may be captured using the local time. It can dramatically affect analytics results if it is not clear whether local time or UTC was recorded. For example, consider an analyst working at a company that makes parking spot occupancy detection sensors. She is tasked with creating predictive models to estimate future parking lot fill rates. The time of day is likely to be a very predictive data point. It makes a big difference to her on how this time is recorded. Even determining if it is night or day at the sensor location will be difficult.
This may not be apparent to the engineer creating the device. His task is to design a device that determines if the spot is open or not. He may not appreciate the importance of writing code that captures a time value that can be aggregated across multiple time zones and locations.
There can also be issues with clock synchronization. Devices set their internal clock to be in sync with the time standard being used. If it is local time, it could be using the wrong time zone due to a configuration error. It could also get out of sync due to a communication problem with the time standard source.
If local time is being used, daylight savings time can cause problems. How will the events that happen between 1 a.m. and 2 a.m. on the day autumn daylight savings is adjusted be recorded since that hour happens twice? Laws that determine which days mark daylight savings time can change, as they did in Turkey when DST was scrapped in September 2016. If the device is locked into a set date range at the time of manufacture, the time would be incorrect for several days out of the year after the DST dates change.
How daylight savings time changes is different from country to country. In the United States, daylight savings time is changed at 02:00 local time in each time zone. In the European Union, it is coordinated so that all EU countries change at 01:00 GMT for all time zones at once. This keeps time zones always an hour apart at the expense of it changing at different local times for each time zone.
In early 2008, Central Brazil was one, two, or three hours ahead of eastern U.S., depending on the date
Source: Wikipedia commons
When time is recorded for an event, such as a parking spot being vacated, it is essential for analytics that the time is as close to the actual occurrence as possible. In practice, though, the time available for analytics can be the time the event occurred, the time the IoT device sent the data, the time the data was received, or the time the data was added to your data warehouse.
IoT devices are located in multiple geographic locations. Different areas of the world have different environmental conditions. Temperature variations can affect sensor accuracy. You could have less accurate readings in Calgary, Canada than in Cancun, Mexico, if cold impacts your device.
Elevation can affect equipment such as diesel engines. If location and elevation is not taken into consideration, you may falsely conclude from IoT sensor readings that a Denver-based fleet of delivery trucks is poorly managing fuel economy compared to a fleet in Indiana. Lots of mountain roads can burn up some fuel!
US elevation profile from LA to NYC. Source: reddit.com
Remote locations may have weaker network access. The higher data loss could cause data values for those locations to be underrepresented in the resulting analytics.
Many IoT devices are solar powered. The available battery charge can affect the frequency of data reporting. A device in Portland, Oregon, where it is often cloudy and rainy will be more impacted than the same device in Phoenix, Arizona, where it is mostly sunny.
There are also political considerations related to the location of the IoT device. Privacy laws in Europe affect how the data from devices can be stored and what type of analytics is acceptable. You may be required to anonymize the data from certain countries, which can affect what you can do with analytics.
Constrained devices means lossy networks. For analytics, it often results in either missing or inconsistent data. The missing data is often not random. As mentioned previously, it can be impacted by the location. Devices run on a software, called firmware, which may not be consistent across locations. This could mean differences in reporting frequency or formatting of values. It can result in lost or mangled data.
Data messages from IoT devices often require the destination to know how to interpret the message being sent. Software bugs can lead to garbled messages and data records.
Messages lost in translation or never sent due to dead batteries result in missing values. The conservation of power often means not all values available on the device are sent at the same time. The resulting datasets often have missing values, as the device sends some values consistently every time it reports and sends some other values less frequently.
Analytics often requires deciding on whether to fill in or ignore the missing values. Either choice may lead to a dataset that is not a representative of reality.
As an example of how this can affect results, consider the case of inaccurate political poll results in recent years. Many experts believe it is now in near crisis due to the shift of much of the world to mobile numbers as their only phone number. For pollsters, it is cheaper and easier to reach people on landline numbers. This can lead to the over representation of people with landlines. These people tend to be both older and wealthier than mobile-only respondents.
The response rate has also dropped from near 80% in the 1970s to about 8% (if you are lucky) today. This makes it more difficult (and expensive) to obtain a representative sample leading to many embarrassingly wrong poll predictions.
There can also be outside influences, such as environment conditions, that are not captured in the data. Winter storms can lead to power failures affecting devices that are able to report back data. You may end up drawing conclusions based on a non-representative sample of data without realizing it. This can affect the results of IoT analytics – and it will not be clear why.
Since connectivity is a new thing for many devices, there is also often a lack of historical data to base predictive models on. This can limit the type of analytics that can be done with the data.
It can also lead to a recency bias in datasets, as newer products are over represented in the data simply because a higher percentage are now a part of the IoT.
This leads us to the author's number one rule in IoT analytics:
Never trust data you don't know.
Treat it like a stranger offering you candy.
Many companies are struggling to find value with IoT data. The costs to store, process, and analyze IoT data can grow quickly. With future financial returns uncertain, some companies are questioning if it is worth the investment.
According to McKinsey & Company, a consulting agency, most IoT data is not used. From their research, less than 1% of data generated by an oil platform was used for decision-making purposes.
Finding value with IoT analytics is often like finding a diamond in a mountain of rubble. We can accept that 1% of the data has value, but which 1% is it? This can vary depending on the question. One man's worthless granite is another man's priceless diamond.
The business value challenge is how to keep costs low while increasing the ability to create superior financial returns. Analytics is a great way to get there.
In this chapter, we defined, for the purposes of this book, what constitutes the IoT. We also defined what is meant by the term analytics when we use it here. We discussed special challenges that come with IoT data from volume of data to issues with time and space that are not normally a concern with internal company datasets. You should have a good idea of the scope of the book and the challenges that you will learn to overcome in the later chapters.