Packt+ | Advance your knowledge in tech

You're reading from HBase Administration Cookbook

Product typeBook

Published inAug 2012

PublisherPackt

ISBN-139781849517140

Edition1st Edition

Tools

Hadoop HBase

Concepts

Database Administration

Author (1)

Yifeng Jiang

Chapter 2. Data Migration

In this chapter, we will cover:

Importing data from MySQL using a single client
Importing data from TSV files using the bulk load tool
Writing your own MapReduce job to import data
Precreating regions before moving data into HBase

Introduction

There are several ways to move data into HBase:

Using the HBase Put API
Using the HBase bulk load tool
Using a customized MapReduce job

The HBase Put API is the most straightforward method. Its usage is not difficult to learn. For most situations however, it is not always the most efficient method. This is especially true when a large amount of data needs to be transferred into HBase within a limited time period. The volume of data to be taken care of is usually huge, and that's probably why you will be using HBase rather than another database. You have to think about how to move all that data into HBase carefully at the beginning of your HBase project; otherwise you might run into serious performance problems.

HBase has the bulk load feature to support loading of huge volumes of data efficiently into HBase. The bulk load feature uses a MapReduce job to load data into a specific HBase table by generating HBase's internal HFile data format files and then loading the data files...

Importing data from MySQL via single client

The most usual case of data migration might be importing data from an existing RDBMS into HBase. For this kind of task, the most simple and straightforward way could be to fetch the data from a single client and then put it into HBase, using the HBase Put API. It works well if there is not too much data to transfer.

This recipe describes importing data from MySQL into HBase using its Put API. All the operations will be executed on a single client. MapReduce is not included in this recipe. This recipe leads you through creating an HBase table via HBase Shell, connecting to the cluster from Java, and then putting data into HBase.

Getting ready

Public data sets are an ideal data source to practice HBase data migration. There are many public data sets available on the internet. We will use the NOAA'S 1981-2010 CLIMATE NORMALS public data set in this book. You can access it at http://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/.

This is climate statistics...

Importing data from TSV files using the bulk load tool

HBase has an importtsv tool to support importing data from TSV files into HBase. Using this tool to load text data into HBase is very efficient, because it runs a MapReduce job to perform the importing. Even if you are going to load data from an existing RDBMS, you can dump data into a text file somehow and then use importtsv to import dumped data into HBase. This approach works well when importing a huge amount of data, as dumping data is much faster than executing SQL on RDBMS.

The importtsv tool does not only load data directly into an HBase table, it also supports generating HBase internal format (HFile) files, so that you can use the HBase bulk load tool to load generated files directly into a running HBase cluster. This way, you reduce network traffic that was generated from the data transfers and your HBase load, during the migration.

This recipe describes usage of the importtsv and bulk load tools. We first demonstrate loading...

Writing your own MapReduce job to import data

Although the importtsv tool is very useful for loading text files into HBase, in many cases, for full control of the loading process you may want to write your own MapReduce job to import data into HBase. For example, the importtsv tool does not work if you are going to load files of other formats.

HBase provides TableOutputFormat for writing data into an HBase table from a MapReduce job. You can also generate its internal HFile format files in your MapReduce job by using the HFileOutputFormat class, and then load the generated files into a running HBase cluster, using the completebulkload tool we described in the previous recipe.

In this recipe, we will explain the steps for loading data using your own MapReduce job. We will first describe how to use TableOutputFormat. In the There's more... section, we will explain how to generate HFile format files in a MapReduce job.

Getting ready

We will use the raw NOAA hly-temp-normal.txt file in this...

Precreating regions before moving data into HBase

Each HBase row belongs to a particular region. A region holds a range of sorted HBase rows. Regions are deployed and managed by a region server.

When we create a table in HBase, the table starts with a single region. All data inserted into the table goes to the single region, first. Data keeps being inserted, and when it reaches a threshold, the region will be split into two halves. This is called region splitting. Split regions will be distributed to other region servers, so that the load can be balanced among the clusters.

As you can imagine, if we can initialize the table with precreated regions, using an appropriate algorithm, the load of the data migration will be balanced over the entire cluster, which increases data load speed significantly.

We will describe how to create a table with precreated regions in this recipe.

Getting ready

How to do it...

Execute the following command on the client node:

$ $HBASE_HOME...

The rest of the chapter is locked

You have been reading a chapter from

HBase Administration Cookbook

Published in: Aug 2012Publisher: PacktISBN-13: 9781849517140

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages