Reader small image

You're reading from  Apache Hive Essentials

Product typeBook
Published inFeb 2015
Reading LevelIntermediate
PublisherPackt
ISBN-139781783558575
Edition1st Edition
Languages
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Introducing big data


Big data is not simply a big volume of data. Here, the word "Big" refers to the big scope of data. A well-known saying in this domain is to describe big data with the help of three words starting with the letter V. They are volume, velocity, and variety. But the analytical and data science world has seen data varying in other dimensions in addition to the fundament 3 Vs of big data such as veracity, variability, volatility, visualization, and value. The different Vs mentioned so far are explained as follows:

  • Volume: This refers to the amount of data generated in seconds. 90 percent of the world's data today has been created in the last two years. Since that time, the data in the world doubles every two years. Such big volumes of data is mainly generated by machines, networks, social media, and sensors, including structured, semi-structured, and unstructured data.

  • Velocity: This refers to the speed in which the data is generated, stored, analyzed, and moved around. With the availability of Internet-connected devices, wireless or wired, machines and sensors can pass on their data immediately as soon as it is created. This leads to real-time streaming and helps businesses to make valuable and fast decisions.

  • Variety: This refers to the different data formats. Data used to be stored as text, dat, and csv from sources such as filesystems, spreadsheets, and databases. This type of data that resides in a fixed field within a record or file is called structured data. Nowadays, data is not always in the traditional format. The newer semi-structured or unstructured forms of data can be generated using various methods such as e-mails, photos, audio, video, PDFs, SMSes, or even something we have no idea about. These varieties of data formats create problems for storing and analyzing data. This is one of the major challenges we need to overcome in the big data domain.

  • Veracity: This refers to the quality of data, such as trustworthiness, biases, noise, and abnormality in data. Corrupt data is quite normal. It could originate due to a number of reasons, such as typos, missing or uncommon abbreviation, data reprocessing, system failures, and so on. However, ignoring this malicious data could lead to inaccurate data analysis and eventually a wrong decision. Therefore, making sure the data is correct in terms of data audition and correction is very important for big data analysis.

  • Variability: This refers to the changing of data. It means that the same data could have different meanings in different contexts. This is particularly important when carrying out sentiment analysis. The analysis algorithms are able to understand the context and discover the exact meaning and values of data in that context.

  • Volatility: This refers to how long the data is valid and stored. This is particularly important for real-time analysis. It requires a target scope of data to be determined so that analysts can focus on particular questions and gain good performance out of the analysis.

  • Visualization: This refers to the way of making data well understood. Visualization does not mean ordinary graphs or pie charts. It makes vast amounts of data comprehensible in a multidimensional view that is easy to understand. Visualization is an innovative way to show changes in data. It requires lots of interaction, conversations, and joint efforts between big data analysts and business domain experts to make the visualization meaningful.

  • Value: This refers to the knowledge gained from data analysis on big data. The value of big data is how organizations turn themselves into big data-driven companies and use the insight from big data analysis for their decision making.

In summary, big data is not just about lots of data, it is a practice to discover new insight from existing data and guide the analysis for future data. A big-data-driven business will be more agile and competitive to overcome challenges and win competitions.

Previous PageNext Page
You have been reading a chapter from
Apache Hive Essentials
Published in: Feb 2015Publisher: PacktISBN-13: 9781783558575
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du