Packt+ | Advance your knowledge in tech

All Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

Data Lake for Enterprises

You're reading from Data Lake for Enterprises

Product type Book

Published in May 2017

Publisher Packt

ISBN-13 9781787281349

Pages 596 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Authors (3):

Vivek Mishra

Tomcy John

Pankaj Misra

View More author details

Table of Contents (23) Chapters

Title Page

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Introduction to Data

Comprehensive Concepts of a Data Lake

Lambda Architecture as a Pattern for Data Lake

Applied Lambda for Data Lake

Data Acquisition of Batch Data using Apache Sqoop

Data Acquisition of Stream Data using Apache Flume

Messaging Layer using Apache Kafka

Data Processing using Apache Flink

Data Store Using Apache Hadoop

Indexed Data Store using Elasticsearch

Data Lake Components Working Together

Data Lake Use Case Suggestions

Indexing Documents

Now that we have the index created, we can indexed a document into the index. The only requirement for a document to be indexed is that it should be a JSON document as Elasticsearch is schema-less and derives the storage schema based on the document structure indexed.

As shown in the following figure, the command to index a document is PUT {index-name}/{type}/{id}:

PUT datalake/contacts/101
{
  "id":101,
  "cell":"(478) 531-2026",
  "work":”1-906-774-1226",
  "email":"vincenzo.hickle@yahoo.com"
}

Figure 24: Document Creation Query

As the document is indexed, Elasticsearch internally creates document mapping based on the data in the initial document. This mapping can be accessed as shown here:

GET datalake/_mapping/contacts

Figure 25: Retrieving Document Mapping via Query

A few observations from the preceding screenshot:

Keyword analyzer is the default analyzer used for schema-less indexing
Based on the input type, the type information in the schema has been derived
Certain default...

The rest of the chapter is locked

Register for a free Packt account to unlock a world of extra content!

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (3)

Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com

Read more

See other products by Vivek Mishra

Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.

Read more

See other products by Tomcy John

Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.

Read more

See other products by Pankaj Misra

Other recommended products

Related to this chapter

Apache Kafka 1.0 Cookbook

Apache Kafka 1.0 Cookbook

Apache Kafka is an open source stream processing platform to handle real-time data feeds. This book is a highly practical guide to help you understand the fundamentals as well as the advanced applications of Apache Kafka as an enterprise messaging service. It begins with configuring the basic Kafka APIs, and then shows you how to set up Kafka clusters and basic Kafka operations. It covers the recently released Kafka version 1.0, the Confluent Platform and Kafka Streams. By the end of this book, you will have all the knowledge you need to take your understanding of Apache Kafka to the next level and tackle any problem you might encounter while working with it.

Dec 2017 8 hours 20 minutes

Apache Kafka Quick Start Guide

Apache Kafka Quick Start Guide

Learn how to use Apache Kafka for efficient processing of distributed applications. This book focuses on programming rather than configuration management of Kafka clusters or Dev Ops. Each chapter focuses on a practical aspect and tries to avoid the tedious theoretical sections. By the end of this book, you will be familiar with solving everyday problems in fast data and processing pipelines.

Dec 2018 6 hours 12 minutes

Learning Apache Flink

Learning Apache Flink

Feb 2017 9 hours 20 minutes

Modern Big Data Processing with Hadoop

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

Mar 2018 13 hours 8 minutes

Apache Hadoop 3 Quick Start Guide

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

Oct 2018 7 hours 20 minutes

Building Data Streaming Applications with Apache Kafka

Building Data Streaming Applications with Apache Kafka

Apache Kafka is a popular distributed streaming platform which acts as a messaging queue or an enterprise messaging system. This book is a comprehensive guide on designing and architecting enterprise-grade streaming applications using Apache Kafka and other Big Data tools. Once you grasp the basics, we will take you through the more advance concepts in Apache Kafka such as capacity planning and security.

Aug 2017 9 hours 16 minutes

Hadoop 2.x Administration Cookbook

Hadoop 2.x Administration Cookbook

A practical and use case driven approach to Hadoop administration with coverage on a vast array of topics including Hadoop cluster installation, performance tuning, cluster planning, security, and much more. This book covers Hadoop from the perspective of running clusters in critical and large environments with complex data and at scale.

May 2017 11 hours 36 minutes

Practical Real-time Data Processing and Analytics

Practical Real-time Data Processing and Analytics

Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible. This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.

Sep 2017 12 hours 0 minutes

Apache Hive Essentials

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

Jun 2018 7 hours 0 minutes

Mastering Hadoop 3

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

Feb 2019 18 hours 8 minutes

MySQL 8 for Big Data

MySQL 8 for Big Data

MySQL is one of the most popular relational databases in the world today, and has become a popular choice of tool to handle vast amounts of structured data - that is, structured Big Data. This book will demonstrate how you can dabble with large amounts of data using MySQL 8. It also highlights topics such as integrating MySQL 8 and a Big Data solution like Apache Hadoop using different tools like Apache Sqoop and MySQL Applier. With practical examples and use-cases, you will get a better clarity on how you can leverage the offerings of MySQL 8 to build a robust Big Data solution.

Oct 2017 9 hours 52 minutes

HBase High Performance Cookbook

HBase High Performance Cookbook

Jan 2017 11 hours 40 minutes

Personalised recommendations for you

Based on your interests and search pattern

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

Aug 2023 7 hours 40 minutes

Generative AI with LangChain

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Mastering Tableau 2023

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

Aug 2023 22 hours 48 minutes

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

Sep 2023 8 hours 36 minutes

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

Sep 2023 8 hours 36 minutes

Data Engineering with AWS

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Oct 2023 21 hours 12 minutes

Modern Data Architecture on AWS

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

Aug 2023 14 hours 0 minutes

Practical Guide to Applied Conformal Prediction in Python

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

Dec 2023 8 hours 0 minutes

TinyML Cookbook

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

Nov 2023 22 hours 8 minutes