Learning Cloudera Impala — Save 50%
Perform interactive, real-time in-memory analytics on large amounts of data using the massive parallel processing engine Cloudera Impala with this book and ebook
In this article written by Avkash Chauhan, author of Learning Cloudera Impala, we will first learn about various important components of Impala and then discuss the intricate details about Impala inner-workings.
(For more resources related to this topic, see here.)
Impala Core Components:
Here we will discuss following three important components:
- Impala Daemon
- Impala Statestore
- Impala Metadata and Metastore
Putting together above components with Hadoop and application or command line interface, we can conceptualize them as below:
Impala Execution Architecture:
Essentially Impala daemons receives queries from variety of sources and distribute query load to other Impala daemons running on other nodes and while doing so interact with Statestore for node specific update and access Metastore, either stored in centralized database or in local cache. Now to complete the Impala execution we will discuss how Impala interacts with other components i.e. Hive, HDFS and HBase.
Impala working with Apache Hive:
We have already discussed earlier about Impala Metastore using the centralized database as Metastore and Hive also uses the same MySQL or PostgreSQL database for same kind of data. Impala provides same SQL like queries interface use in Apache Hive. Because both Impala and Hive share same database as Metastore, Impala can access Hive specific tables definitions if Hive table definition use the same file format, compression codecs and Impala-supported data types in their column values.
Apache Hive provides various kinds of file type processing support to Impala. When using other then text file format i.e. RCFile, Avro, SequenceFile the data must be loaded through Hive first and then Impala can query the data from these file formats. Impala can perform read operation on more types of data using SELECT statement than it can perform write operation using INSERT statement. The ANALYZE TABLE statement in Hive generates useful table and column statistics and Impala use these valuable statistics to optimize the queries.
Impala working with HDFS:
Impala table data is actually regular data files stored in HDFS (Hadoop Distributed File System) and Impala uses HDFS as its primary data storage medium. As soon as a data file or a collection of files is available in specific folder of new table, Impala reads all of the files regardless of their name and new data is included in files with the name controlled by Impala. HDFS provides data redundancy through replication factor and Impala relies on such redundancy to access data on other datanodes in case it is not available on a specific datanode. We have already learnt earlier that Impala also maintains the information about physical location of the blocks about data files in HDFS,which helps data access in case of node failure.
Impala working with HBase:
HBase is a distributed, scalable, big data storage system, provides random, real-time read and write access to data stored on HDFS. HBase is a database storage system, sits on top of HDFS however like other traditional database storage system, HBase does not provide built-in SQL support however 3party applications can provide such functionality.
To use HBase, first user defines tables in Impala and then maps them to the equivalent HBase tables. Once table relationship is established, users can submit queries into HBase table through Impala. Not only that join operations can be formed including HBase and Impala tables.
Impala is designed & developed on run on top of Hadoop. So you must understand the Hadoop security model as well as the security provided in OS where Hadoop is running. If Hadoop is running on Linux then as Linux administrator and Hadoop administrator user can harden and tighten the security, which definitely can be taken in account with the security provided by Impala. Impala 1.1 or later uses Sentry Open Source Project to provide detailed authorization framework for Hadoop. Impala 1.1.1 supports auditing capabilities in cluster by creating auditing data, which can be collected from all nodes and then processing for further analysis and insight.
Data Visualization using Impala:
Visualizing data is as important as processing the data. Human brain perceives pictures fast then reading data in tables and because of it data visualization provides super fast understanding to large amount of data in split seconds. Reports, charts, interactive dashboards and any form of info-graphics are all part of data visualization and provide deeper understanding of results.
To connect with 3rd party applications, Cloudera provides ODBC and JDBC connectors. These connectors are installed on machines where 3rd party applications are running and by configuring correct Impala server and port details on those connectors, 3rd party applications connect with Impala and submit those queries and then take results back to application. The result then displayed on 3rd party application, where it is rendered on graphics device for visualization or displayed in table format or processed further depending on application requirement. In this section we will cover few notable 3rd party applications, which can take advantage of Impala super fast query processing and than display amazing graphical results.
Tableau and Impala:
Tableau Software supporting Impala by providing access to tables on Impala using Impala ODBC connector provided by Tableau. Tableau is one of the most prominent data visualization software technologies in recent days and used by thousands of enterprises daily to get intelligence out of their data. Tableau software is available on Windows OS and an ODBC connector is provided by Cloudera to make this connection a reality. You can visit the link below to download Impala connector for Tableau:
Once Impala connector is installed on a machine where Tableau software is running, and configured correctly, Tableau software is ready to work with Impala. In this image below Tableau is connected to Impala server at port 21000, and then selected a table located at Impala:
Once table is selected, particular fields are select and data is displayed in graphical format in various mind-blowing visualizations. The screenshot below displays one example of showing such visualization:
Microsoft Excel and Impala
Microsoft Excel is one of the widely adopted data processing application used by business professional worldwide. You can connect Microsoft Excel with Impala using another ODBC connector provided by Simba Technology.
Microstrategy and Impala
Microstrategy is another big player in data analysis and visualization software and uses ODBC drive to connect with Impala to render amazing looking visualizations. The connectivity model between Microstrategy software and Cloudera Impala is shown as below:
Zoomdata and Impala:
Zoomdata is considered to new generation of data user interface by addressing streams of data instead of sets of data. Zoomdata processing engine performs continuous mathematical operations across data streams in real-time to create visualization on multitude of devices. The visualization updates itself as the new data arrives and re-computed by Zoomdata.
As shown in in the image below, you can see Zoomdata application uses Impala as a source of data, which is configured underneath to use of one the available connectors to connect with Impala:
Once connection are made user can see amazing data visualization as shown below:
Real-time Query with Impala on Hadoop:
Impala is marketed as a product, which can do “Real-time queries on Hadoop” by its developer Cloudera. Impala is open source implementation based on above-mentioned Google Dremel technology, available free for anyone of use. Impala is available as package product, free to use or can be compiled from its source, which can run queries in memory to make them real-time and in some cases depending on type of data, if Parquet file format is used as input data source, it can expedite the query processing to multifold speed.
Real-time query subscription with Impala:
Cloudera provides Real-time Query (RTQ) Subscription as an add-on to Cloudera Enterprise subscription. You can still use Impala as free open source product however taking RTQ subscription makes you take advantage of Cloudera paid service to extend its usability and resilience. By accepting RTQ subscription you cannot only have access to Cloudera Technical support, but also you can work with Impala development team to provide ample feedback to shape up the product design and implementation.
Thus concludes the discussion on the key components of Impala and their inner working.
Resources for Article:
- Securing the Hadoop Ecosystem [Article]
- Cloudera Hadoop and HP Vertica [Article]
- Hadoop and HDInsight in a Heartbeat [Article]
|Perform interactive, real-time in-memory analytics on large amounts of data using the massive parallel processing engine Cloudera Impala with this book and ebook|
eBook Price: $20.99
Book Price: $34.99
About the Author :
Avkash Chauhan is a software technology veteran with more than 12 years of industry experience in various disciplines such as embedded engineering, cloud computing, big data analytics, data processing, and data visualization. He has an extensive global work experience with Fortune 100 companies worldwide. He has spent the last eight years at Microsoft before moving on to Silicon Valley to work with a big data and analytics start-up. He started his career as an embedded engineer; and during his eight-year long gig at Microsoft, he worked on Windows CE, Windows Phone, Windows Azure, and HDInsight. He spent several years working with the Windows Azure team to develop world-class cloud technology, and his last project was Apache Hadoop on Windows Azure, also known as HDInsight. He worked on the HDInsight project since its incubation at Microsoft, and helped its early development and then deployment on cloud. For the past three years, he has been working on big data- and Hadoop-related technologies by developing applications to make Hadoop easy to use for large- and mid-market companies. He is a prolific blogger and very active on the social networking sites. You can directly contact him through the following: