Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Lake Development with Big Data

You're reading from   Data Lake Development with Big Data Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies

Arrow left icon
Product type Book
Published in Nov 2015
Publisher
ISBN-13 9781785888083
Pages 164 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Toc

Index

A

  • Apache Atlas
    • about / Apache Atlas
    • working / Understanding how Atlas works
    • use case scenarios / Use case scenarios for Atlas
    • reference / Use case scenarios for Atlas
  • Apache Falcon
    • about / Apache Falcon
    • working / Understanding how Falcon works
    • use case scenarios / Use case scenarios for Falcon
    • reference / Use case scenarios for Falcon
  • Apache Flume
    • about / Apache Flume
    • use case scenarios / Use case scenarios for Flume
    • reference / Use case scenarios for Flume
  • architectural guidance
    • about / Architectural guidance
    • Data Discovery / Data Discovery
    • Data Provisioning / Data Provisioning
  • architectural guidance, Intake tier
    • about / Architectural guidance
    • structured data use cases / Structured data use cases
    • semi-structured use cases / Semi-structured and unstructured data use cases
    • unstructured data use cases / Semi-structured and unstructured data use cases
    • Big Data tools / Big Data tools and technologies
  • architecture, Data Lake
    • considerations / Architectural considerations
    • composition / Architectural composition
    • layers / Architectural composition, Architectural details
    • tiers / Architectural composition, Understanding Data Lake tiers

B

  • Big Data tools
    • about / Big Data tools and technologies
    • Syncsort / Syncsort
    • Talend / Talend
    • Pentaho / Pentaho
  • Big Data tools, for ingesting streaming data
    • about / Ingestion of streaming data
    • Apache Flume / Apache Flume
    • Fluentd / Fluentd
    • Kafka / Kafka
    • Amazon Kinesis / Amazon Kinesis
    • Apache Storm / Apache Storm
  • Big Data tools, for ingesting structured data
    • about / Ingestion of structured data
    • Sqoop / Sqoop
    • WebHDFS / WebHDFS

C

  • Customer Relationship Management (CRM) / Practical Data Integration scenarios

D

  • Data as a Service (DaaS) / When to go for a Data Lake implementation
  • data classification
    • about / Data classification
    • unstructured data, classifying / Classifying unstructured data
    • applications / Applications of data classification
  • data cleansing, Data Integration process
    • about / Data cleansing
    • missing values, deleting / Deletion of missing, null, or invalid values
    • null values, deleting / Deletion of missing, null, or invalid values
    • invalid values, deleting / Deletion of missing, null, or invalid values
    • invalid values, imputation / Imputation of missing, null, or invalid values
    • missing values, imputation / Imputation of missing, null, or invalid values
    • null values, imputation / Imputation of missing, null, or invalid values
  • Data Consumption
    • traditional, versus Data Lake / Data Consumption – Traditional versus Data Lake
    • about / An introduction to Data Consumption
    • data owner / An introduction to Data Consumption
    • data steward / An introduction to Data Consumption
    • data consumer / An introduction to Data Consumption
    • processes, applying / An introduction to Data Consumption
    • practical scenarios / Practical Data Consumption scenarios
  • Data Consumption tier
    • about / The Data Consumption tier, Understanding the Data Consumption tier
    • Data Discovery Zone / The Data Discovery Zone
    • Data Provisioning Zone / The Data Provisioning Zone
  • Data Discovery
    • about / Data Discovery and metadata
    • enabling / Enabling Data Discovery
    • data classification / Data classification
    • relation extraction / Relation extraction
    • indexing / Indexing data
    • performing / Performing Data Discovery
    • semantic search / Semantic search
    • faceted search / Faceted search
    • fuzzy search / Fuzzy search
    • Big Data tools / Big Data tools and technologies
  • Data Dispatch
    • about / Data Dispatch
    • use case scenarios / Use case scenarios for Data Dispatch
    • reference / Use case scenarios for Data Dispatch
  • data governance
    • about / An introduction to Data Consumption
  • Data Governance
    • about / Understanding Data Governance
    • features / Introduction to Data Governance
    • need for / The need for Data Governance
    • Big Data, governing / Governing Big Data in the Data Lake
    • traditional, versus Data Lake / Data Governance – Traditional versus Data Lake
    • practical scenarios / Practical Data Governance scenarios
    • components / Data Governance components
    • metadata management / Metadata management and lineage tracking
    • data security / Data security and privacy
    • data privacy / Data security and privacy
    • Information Lifecycle Management / Information Lifecycle Management
    • architectural guidance / Architectural guidance
    • Big Data tools / Big Data tools and technologies
  • Data Governance and Security layer
    • about / The Data Governance and Security Layer
  • data ingest parallelism
    • about / Data ingest parallelism
    • limitations, addressing with Data Lake / Addressing the limitations using Data Lake
  • Data Intake tier
    • about / The Data Intake tier
    • Source System Zone / The Source System Zone
    • Transient Zone / The Transient Zone
    • Raw Zone / The Raw Zone
    • Batch Raw Storage / Batch Raw Storage
    • real-time Raw Storage / The real-time Raw Storage
  • Data Integration
    • about / Understanding Data Integration
    • architecture / Introduction to Data Integration
    • features / Prominent features of Data Integration
    • practical scenarios / Practical Data Integration scenarios
    • working / The workings of Data Integration
  • Data Integration process
    • steps / The workings of Data Integration
    • Raw data discovery / Raw data discovery
    • data quality assessment / Data quality assessment
    • data cleansing / Data cleansing
    • data transformations / Data transformations
    • data enrichment / Data enrichment
    • metadata, collecting / Collect metadata and track data lineage
    • data lineage, tracking / Collect metadata and track data lineage
  • Data Integrity checks, Transient Landing Zone
    • about / Data Integrity checks
    • record counts, checking / Checking record counts
    • column counts, checking for / Checking for column counts
    • schema validation checks / Schema validation checks
  • Data Lake
    • evolution / Before the Data Lake
    • about / Before the Data Lake
    • need for / Need for Data Lake
    • defining / Defining Data Lake
    • key benefits / Key benefits of Data Lake
    • challenges / Challenges in implementing a Data Lake
    • recommendations / When to go for a Data Lake implementation
    • architecture / Data Lake architecture
    • current and future trends / The current and future trends
    • future enterprise trajectories / Data Lake and future enterprise trajectories
    • future technologies / Future Data Lake technologies
  • Data Lake layers
    • about / Architectural composition, Understanding Data Lake layers
    • Data Governance and Security layer / The Data Governance and Security Layer
    • Information Lifecycle Management layer / The Information Lifecycle Management layer
    • Metadata layer / The Metadata Layer
  • Data Lake tiers
    • about / Architectural composition, Understanding Data Lake tiers
    • Data Intake tier / The Data Intake tier
    • Data Management tier / The Data Management tier
    • Data Consumption tier / The Data Consumption tier
  • data lineage processes, Raw Storage Zone
    • about / Data lineage processes
    • Watermarking process / Watermarking process
    • metadata capture / Metadata capture
  • Data Management tier
    • about / The Data Management tier
    • Integration Zone / The Integration Zone
    • Enrichment Zone / The Enrichment Zone
    • Data Hub Zone / The Data Hub Zone
  • Data Management Tier
    • about / Introduction to the Data Management Tier
  • data partitioning
    • about / Data partitioning
    • limitations, addressing with Data Lake / Addressing the limitations using Data Lake
  • data pipelines
    • about / Data pipelines
    • limitations, addressing with Data Lake / Addressing the limitations using Data Lake
  • Data Provisioning
    • about / An introduction to Data Consumption, Data Provisioning and metadata
    • metadata / Data Provisioning and metadata
    • data publication / Data publication
    • data subscription / Data subscription
    • functionalities / Data Provisioning functionalities
    • data formatting / Data formatting
    • data selection / Data selection
    • approaches / Data Provisioning approaches
    • post-provisioning processes / Post-provisioning processes
    • Big Data tools / Big Data tools and technologies
  • data quality assessment, Data Integration process
    • about / Data quality assessment
    • data profiling / Profiling the data
  • data security and privacy
    • about / Data security and privacy
    • Big Data implications / Big Data implications for security and privacy
  • data transformations, Data Integration process
    • about / Data transformations
    • unstructured text transformation techniques / Unstructured text transformation techniques
    • structured data transformations / Structured data transformations
  • deduplication process, file duplication checks
    • file names, comparing / File duplication checks
    • schema, comparing / File duplication checks
    • content, comparing / File duplication checks
  • deep Integrity checks, Raw Storage Zone
    • about / Deep Integrity checks
    • bit level Integrity checks / Bit Level Integrity checks
    • periodic checksum checks / Periodic checksum checks
  • Domain Specific Languages (DSL)
    • inference / Applications of relation extraction

E

  • Elasticsearch
    • about / Elasticsearch
    • use case scenarios / Use case scenarios for Elasticsearch
    • reference / Use case scenarios for Elasticsearch
  • Electronic Health Records (EHR)
    • implementing / Understanding how semantic technologies work
  • Enterprise Resource Planning (ERP) / Practical Data Integration scenarios
  • extensibility
    • about / Extensibility
    • limitations, addressing / Addressing the limitations using Data Lake
  • External Data Sources / Understanding Intake tier zones
  • extract transform load (ETL) / Batch Raw Storage

F

  • faceted search
    • about / Faceted search
  • feature-based methods
    • about / Feature-based methods
    • working / Understanding how feature-based methods work
    • implementing / Implementation
  • features, Data Integration
    • loosely coupled integration / Loosely coupled Integration
    • user friendly / Ease of use
    • secure access / Secure access
    • high-quality data / High-quality data
    • lineage tracking / Lineage tracking
  • file validation checks, Transient Landing Zone
    • about / File validation checks
    • file duplication checks / File duplication checks
    • deduplication process / File duplication checks
    • file integrity checks / File integrity checks
    • file size checks / File size checks
    • file periodicity checks / File periodicity checks
  • Fluentd
    • about / Fluentd
    • use case scenarios / Use case scenarios for Fluentd
    • reference / Use case scenarios for Fluentd
  • fuzzy search
    • about / Fuzzy search
    • edit distance / Edit distance
    • wildcard / Wildcard and regular expressions
    • regular expressions / Wildcard and regular expressions

H

  • high-performance tier / The Management Tier

I

  • IBM Big Data platform
    • about / IBM Big Data platform
    • working / Understanding how governance is provided in IBM Big Data platform
    • use case scenarios / Use case scenarios for IBM Big Data platform
    • reference / Use case scenarios for IBM Big Data platform
  • IBM Infosphere Data Explorer
    • about / IBM InfoSphere Data Explorer
    • use case scenarios / Use case scenarios for IBM InfoSphere Data Explorer
    • reference / Use case scenarios for IBM InfoSphere Data Explorer
  • IBM InfoSphere Data Replication / Understanding how governance is provided in IBM Big Data platform
  • IBM InfoSphere Federation Server / Understanding how governance is provided in IBM Big Data platform
  • IBM InfoSphere Guardium / Understanding how governance is provided in IBM Big Data platform
  • IBM InfoSphere Information Server / Understanding how governance is provided in IBM Big Data platform
  • IBM InfoSphere Master Data Management / Understanding how governance is provided in IBM Big Data platform
  • IBM InfoSphere Optim / Understanding how governance is provided in IBM Big Data platform
  • imputation techniques
    • about / Imputation of missing, null, or invalid values
    • mean value imputation / Imputation of missing, null, or invalid values
    • median value imputation / Imputation of missing, null, or invalid values
    • mode value imputation / Imputation of missing, null, or invalid values
    • prediction model imputation / Imputation of missing, null, or invalid values
  • Incremental Data Load, structured data
    • time stamps / Structured data loading approaches
    • partitioning / Structured data loading approaches
    • change tables / Structured data loading approaches
    • triggers / Structured data loading approaches
  • indexing
    • about / Indexing data
    • inverted index / Inverted index
    • applications / Applications of Indexing
  • Information Lifecycle Layer (ILM) layer / The Information Lifecycle Management layer
  • Information Lifecycle Management
    • about / Information Lifecycle Management
    • Big Data implications / Big Data implications for ILM
    • implementing / Implementing ILM using Data Lake
    • Intake Tier / The Intake Tier
    • Management Tier / The Management Tier
    • Consumption Tier / The Consumption Tier
  • Information Lifecycle Management layer
    • about / The Information Lifecycle Management layer
  • Intake Processing / Understanding connectivity processing
  • Intake Processing, for data variety
    • about / Understanding Intake Processing for data variety
    • structured data / Structured data
    • semi-structured data / Semi-structured data
    • unstructured data / Unstructured data
  • Intake tier
    • practical data ingestion scenarios / Practical Data Ingestion scenarios
    • architectural guidance / Architectural guidance
  • Intake tier zones
    • about / Understanding Intake tier zones
    • Source System Zone / Understanding Intake tier zones, Source System Zone functionalities
    • Transient Landing Zone / Understanding Intake tier zones, Transient Landing Zone functionalities
    • Raw Storage Zone / Understanding Intake tier zones, Raw Storage Zone functionalities
  • inverted index
    • about / Inverted index
    • working / Understanding how inverted index works
    • implementing / Implementation

K

  • Kafka
    • about / Kafka
    • use case scenarios / Use case scenarios for Kafka
    • reference / Use case scenarios for Kafka
  • Kinesis
    • about / Amazon Kinesis
    • use case scenarios / Use case scenarios for Kinesis
    • reference / Use case scenarios for Kinesis

L

  • Lambda Architecture (LA) / Data Lake and future enterprise trajectories
  • Latent Semantic Analysis (LSA) / Applications of Indexing
  • Lineage Tracking / Data lineage processes
  • Longest Common Subsequence (LCS) / Edit distance
  • low-cost storage tier / The Management Tier

M

  • Massively Parallel Processing (MPP) system / Scale on demand
  • Master Data Management (MDM) / Defining Data Lake, Introduction to Data Governance
  • Metadata layer
    • about / The Metadata Layer
  • Mirroring / Periodic checksum checks

O

  • offline archive tier / The Management Tier
  • Online Analytical Processing (OLAP) systems / Before the Data Lake
  • online archive tier / The Management Tier
  • Online Transaction Processing (OLTP) systems / Before the Data Lake

P

  • partition / Addressing the limitations using Data Lake
  • Pentaho
    • about / Pentaho
    • use case scenarios / Pentaho, Use case scenarios for Pentaho
  • Principal Component Analysis (PCA) / Structured data transformations

R

  • Raw Storage Zone
    • about / Raw Storage Zone functionalities
    • functionalities / Raw Storage Zone functionalities
    • data lineage processes / Data lineage processes
    • deep integrity checks / Deep Integrity checks
    • security and governance / Security and governance
    • Information Lifecycle Management / Information Lifecycle Management
  • relation extraction
    • about / Relation extraction
    • relationships, extracting from unstructured data / Extracting relationships from unstructured data
    • relationships, extracting from structured data / Extracting Relationships from structured data
    • applications / Applications of relation extraction
  • relationships, extracting from unstructured data
    • about / Extracting relationships from unstructured data
    • feature-based methods / Feature-based methods
    • semantic technologies / Semantic technologies
  • Research Data Management (RDM) / Defining Data Lake

S

  • Scale on demand
    • about / Scale on demand
    • limitations, addressing with Data Lake / Addressing the limitations using Data Lake
  • security issues, in Data Lake tiers
    • about / Security issues in the Data Lake tiers
    • Intake Tier / The Intake Tier
    • Management Tier / The Management Tier
    • Consumption Tier / The Consumption Tier
  • semantic search
    • about / Semantic search
    • word sense disambiguation / Word sense disambiguation
    • Latent Semantic Analysis (LSA) / Latent Semantic Analysis
  • semantic technologies / Understanding how semantic technologies work
    • about / Semantic technologies
    • working / Understanding how semantic technologies work
    • Web Ontology Language (OWL) / Understanding how semantic technologies work
    • SPARQL Protocol And RDF Query Language (SPARQL) / Understanding how semantic technologies work
    • inference / Understanding how semantic technologies work
    • implementing / Implementation
  • semi-structured data
    • about / Semi-structured data
    • need for integrating, in Data Lake / The need for integrating semi-structured data in the Data Lake
    • loading approaches / Semi-structured data loading approaches
  • Source System Zone
    • about / Understanding Intake tier zones
    • functionalities / Source System Zone functionalities
    • connectivity processing / Understanding connectivity processing
    • data variety / Understanding Intake Processing for data variety
  • Splunk
    • about / Splunk
    • use case scenarios / Use case scenarios for Splunk
    • reference / Use case scenarios for Splunk
  • Sqoop
    • about / Sqoop
    • use case scenarios / Use case scenarios for Sqoop
    • reference / Use case scenarios for Sqoop
  • Storm
    • about / Apache Storm
    • use case scenarios / Use case scenarios for Storm
    • reference / Use case scenarios for Storm
  • structured data
    • about / Structured data
    • examples / Structured data
    • need for integrating, in Data Lake / The need for integrating Structured Data in the Data Lake
    • loading approaches / Structured data loading approaches
    • Full Data Load / Structured data loading approaches
    • Incremental Data Load / Structured data loading approaches
  • structured data transformations
    • about / Structured data transformations
    • attribute-level structured data transformations / Structured data transformations
    • table-level structured data transformations / Structured data transformations
  • Supply Chain Management (SCM) / Practical Data Integration scenarios
  • Support Vector Machine (SVM) / Structured data transformations
  • Syncsort
    • use case scenarios / Use case scenarios for Syncsort

T

  • Tableau
    • about / Tableau
    • use case scenarios / Use case scenarios for Tableau
    • reference / Use case scenarios for Tableau
  • Talend
    • about / Talend
    • use case scenarios / Use case scenarios for Talend
  • traditional data integration, versus Data Lake
    • about / Traditional Data Integration versus Data Lake
    • data pipelines / Data pipelines
    • data partitioning / Data partitioning
    • scale on demand / Scale on demand
    • data ingest parallelism / Data ingest parallelism
    • extensibility / Extensibility
  • traditional data warehouse (DW) systems / Need for Data Lake
  • Transient Landing Zone
    • about / Transient Landing Zone functionalities
    • functionalities / Transient Landing Zone functionalities
    • file validation checks / File validation checks
    • Data Integrity checks / Data Integrity checks

U

  • unstructured data
    • about / Unstructured data
    • examples / Unstructured data
    • need for integrating, in Data Lake / The need for integrating Unstructured data in the Data Lake
    • loading approaches / Unstructured data loading approaches
  • unstructured data, classifying
    • about / Classifying unstructured data
    • named entity recognition / Named entity recognition
    • Conditional Random Fields (CRF) / Named entity recognition
    • Maximum Entropy Markov Models (MEMM) / Named entity recognition
    • Hidden Markov Models (HMM) / Named entity recognition
    • topic models / Topic modeling
    • Latent Dirichlet Allocation (LDA) / Topic modeling
    • Hierarchical Dirichlet Process (HDP) / Topic modeling
    • text clustering / Text clustering

W

  • WebHDFS
    • use case scenarios / Use case scenarios for WebHDFS
    • reference / Use case scenarios for WebHDFS
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime