Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Hive Essentials
Apache Hive Essentials

Apache Hive Essentials: Immerse yourself on a fantastic journey to discover the attributes of big data by using Hive

By Dayong Du
€25.99 €8.99
Book Feb 2015 208 pages 1st Edition
eBook
€25.99 €8.99
Print
€32.99
Subscription
€14.99 Monthly
eBook
€25.99 €8.99
Print
€32.99
Subscription
€14.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Feb 26, 2015
Length 208 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783558575
Vendor :
Apache
Category :
Table of content icon View table of contents Preview book icon Preview Book

Apache Hive Essentials

Chapter 1. Overview of Big Data and Hive

This chapter is an overview of big data and Hive, especially in the Hadoop ecosystem. It briefly introduces the evolution of big data so that readers know where they are in the journey of big data and find their preferred areas in future learning. This chapter also covers how Hive has become one of the leading tools in big data warehousing and why Hive is still competitive.

In this chapter, we will cover the following topics:

  • A short history from database and data warehouse to big data

  • Introducing big data

  • Relational and NoSQL databases versus Hadoop

  • Batch, real-time, and stream processing

  • Hadoop ecosystem overview

  • Hive overview

A short history


In the 1960s, when computers became a more cost-effective option for businesses, people started to use databases to manage data. Later on, in the 1970s, relational databases became more popular to business needs since they connected physical data to the logical business easily and closely. In the next decade, around the 1980s, Structured Query Language (SQL) became the standard query language for databases. The effectiveness and simplicity of SQL motivated lots of people to use databases and brought databases closer to a wide range of users and developers. Soon, it was observed that people used databases for data application and management and this continued for a long period of time.

Once plenty of data was collected, people started to think about how to deal with the old data. Then, the term data warehousing came up in the 1990s. From that time onwards, people started to discuss how to evaluate the current performance by reviewing the historical data. Various data models and tools were created at that time for helping enterprises to effectively manage, transform, and analyze the historical data. Traditional relational databases also evolved to provide more advanced aggregation and analyzed functions as well as optimizations for data warehousing. The leading query language was still SQL, but it was more intuitive and powerful as compared to the previous versions. The data was still well structured and the model was normalized. As we entered the 2000s, the Internet gradually became the topmost industry for the creation of the majority of data in terms of variety and volume. Newer technologies, such as social media analytics, web mining, and data visualizations, helped lots of businesses and companies deal with massive amounts of data for a better understanding of their customers, products, competition, as well as markets. The data volume grew and the data format changed faster than ever before, which forced people to search for new solutions, especially from the academic and open source areas. As a result, big data became a hot topic and a challenging field for many researchers and companies.

However, in every challenge there lies great opportunity. Hadoop was one of the open source projects earning wide attention due to its open source license and active communities. This was one of the few times that an open source project led to the changes in technology trends before any commercial software products. Soon after, the NoSQL database and real-time and stream computing, as followers, quickly became important components for big data ecosystems. Armed with these big data technologies, companies were able to review the past, evaluate the current, and also predict the future. Around the 2010s, time to market became the key factor for making business competitive and successful. When it comes to big data analysis, people could not wait to see the reports or results. A short delay could make a great difference when making important business decisions. Decision makers wanted to see the reports or results immediately within a few hours, minutes, or even possibly seconds in a few cases. Real-time analytical tools, such as Impala (http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html), Presto (http://prestodb.io/), Storm (https://storm.apache.org/), and so on, make this possible in different ways.

Introducing big data


Big data is not simply a big volume of data. Here, the word "Big" refers to the big scope of data. A well-known saying in this domain is to describe big data with the help of three words starting with the letter V. They are volume, velocity, and variety. But the analytical and data science world has seen data varying in other dimensions in addition to the fundament 3 Vs of big data such as veracity, variability, volatility, visualization, and value. The different Vs mentioned so far are explained as follows:

  • Volume: This refers to the amount of data generated in seconds. 90 percent of the world's data today has been created in the last two years. Since that time, the data in the world doubles every two years. Such big volumes of data is mainly generated by machines, networks, social media, and sensors, including structured, semi-structured, and unstructured data.

  • Velocity: This refers to the speed in which the data is generated, stored, analyzed, and moved around. With the availability of Internet-connected devices, wireless or wired, machines and sensors can pass on their data immediately as soon as it is created. This leads to real-time streaming and helps businesses to make valuable and fast decisions.

  • Variety: This refers to the different data formats. Data used to be stored as text, dat, and csv from sources such as filesystems, spreadsheets, and databases. This type of data that resides in a fixed field within a record or file is called structured data. Nowadays, data is not always in the traditional format. The newer semi-structured or unstructured forms of data can be generated using various methods such as e-mails, photos, audio, video, PDFs, SMSes, or even something we have no idea about. These varieties of data formats create problems for storing and analyzing data. This is one of the major challenges we need to overcome in the big data domain.

  • Veracity: This refers to the quality of data, such as trustworthiness, biases, noise, and abnormality in data. Corrupt data is quite normal. It could originate due to a number of reasons, such as typos, missing or uncommon abbreviation, data reprocessing, system failures, and so on. However, ignoring this malicious data could lead to inaccurate data analysis and eventually a wrong decision. Therefore, making sure the data is correct in terms of data audition and correction is very important for big data analysis.

  • Variability: This refers to the changing of data. It means that the same data could have different meanings in different contexts. This is particularly important when carrying out sentiment analysis. The analysis algorithms are able to understand the context and discover the exact meaning and values of data in that context.

  • Volatility: This refers to how long the data is valid and stored. This is particularly important for real-time analysis. It requires a target scope of data to be determined so that analysts can focus on particular questions and gain good performance out of the analysis.

  • Visualization: This refers to the way of making data well understood. Visualization does not mean ordinary graphs or pie charts. It makes vast amounts of data comprehensible in a multidimensional view that is easy to understand. Visualization is an innovative way to show changes in data. It requires lots of interaction, conversations, and joint efforts between big data analysts and business domain experts to make the visualization meaningful.

  • Value: This refers to the knowledge gained from data analysis on big data. The value of big data is how organizations turn themselves into big data-driven companies and use the insight from big data analysis for their decision making.

In summary, big data is not just about lots of data, it is a practice to discover new insight from existing data and guide the analysis for future data. A big-data-driven business will be more agile and competitive to overcome challenges and win competitions.

Relational and NoSQL database versus Hadoop


Let's compare different data solutions with the ways of traveling. You will be surprised to find that they have many similarities. When people travel, they either take cars or airplanes depending on the travel distance and cost. For example, when you travel to Vancouver from Toronto, an airplane is always the first choice in terms of the travel time versus cost. When you travel to Niagara Falls from Toronto, a car is always a good choice. When you travel to Montreal from Toronto, some people may prefer taking a car to an airplane. The distance and cost here is like the big data volume and investment. The traditional relational database is like the car in this example. The Hadoop big data tool is like the airplane in this example. When you deal with a small amount of data (short distance), a relational database (like the car) is always the best choice since it is more fast and agile to deal with a small or moderate size of data. When you deal with a big amount of data (long distance), Hadoop (like the airplane) is the best choice since it is more linear, fast, and stable to deal with the big size of data. On the contrary, you can drive from Toronto to Vancouver, but it takes too much time. You can also take an airplane from Toronto to Niagara, but it could take more time and cost way more than if you travel by a car. In addition, you may have a choice to either take a ship or a train. This is like a NoSQL database, which offers characters from both a relational database and Hadoop in terms of good performance and various data format support for big data.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

What you will learn

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Feb 26, 2015
Length 208 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783558575
Vendor :
Apache
Category :

Table of Contents

17 Chapters
Apache Hive Essentials Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. Overview of Big Data and Hive Chevron down icon Chevron up icon
2. Setting Up the Hive Environment Chevron down icon Chevron up icon
3. Data Definition and Description Chevron down icon Chevron up icon
4. Data Selection and Scope Chevron down icon Chevron up icon
5. Data Manipulation Chevron down icon Chevron up icon
6. Data Aggregation and Sampling Chevron down icon Chevron up icon
7. Performance Considerations Chevron down icon Chevron up icon
8. Extensibility Considerations Chevron down icon Chevron up icon
9. Security Considerations Chevron down icon Chevron up icon
10. Working with Other Tools Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.