Besides partition, the bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from a partition, a bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses year and month as the top-level partition. If there is a further request to use employee_id as the third level of partition, it creates many partition directories. For instance, we can bucket the employee_partitioned table using employee_id as a bucket column. The value of this column will be hashed by a user-defined number of buckets. The records with the same employee_id will always be stored in the same bucket (segment of files). The bucket columns are defined by CLUSTERED BY keywords. It is quite different from partition columns since partition columns refer to the directory, while bucket...
- Tech Categories
- Best Sellers
- New Releases
- Books
- Videos
- Audiobooks
Tech Categories Popular Audiobooks
- Articles
- Newsletters
- Free Learning
You're reading from Apache Hive Essentials. - Second Edition
Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du
Unlock this book and the full library FREE for 7 days
Author (1)
Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du