Index
A
- Apache Atlas
- about / Apache Atlas
- working / Understanding how Atlas works
- use case scenarios / Use case scenarios for Atlas
- reference / Use case scenarios for Atlas
- Apache Falcon
- about / Apache Falcon
- working / Understanding how Falcon works
- use case scenarios / Use case scenarios for Falcon
- reference / Use case scenarios for Falcon
- Apache Flume
- about / Apache Flume
- use case scenarios / Use case scenarios for Flume
- reference / Use case scenarios for Flume
- architectural guidance
- about / Architectural guidance
- Data Discovery / Data Discovery
- Data Provisioning / Data Provisioning
- architectural guidance, Intake tier
- about / Architectural guidance
- structured data use cases / Structured data use cases
- semi-structured use cases / Semi-structured and unstructured data use cases
- unstructured data use cases / Semi-structured and unstructured data use cases
- Big Data tools / Big Data tools and technologies
- architecture, Data Lake
- considerations / Architectural considerations
- composition / Architectural composition
- layers / Architectural composition, Architectural details
- tiers / Architectural composition, Understanding Data Lake tiers
B
- Big Data tools
- about / Big Data tools and technologies
- Syncsort / Syncsort
- Talend / Talend
- Pentaho / Pentaho
- Big Data tools, for ingesting streaming data
- about / Ingestion of streaming data
- Apache Flume / Apache Flume
- Fluentd / Fluentd
- Kafka / Kafka
- Amazon Kinesis / Amazon Kinesis
- Apache Storm / Apache Storm
- Big Data tools, for ingesting structured data
- about / Ingestion of structured data
- Sqoop / Sqoop
- WebHDFS / WebHDFS
C
- Customer Relationship Management (CRM) / Practical Data Integration scenarios
D
- Data as a Service (DaaS) / When to go for a Data Lake implementation
- data classification
- about / Data classification
- unstructured data, classifying / Classifying unstructured data
- applications / Applications of data classification
- data cleansing, Data Integration process
- about / Data cleansing
- missing values, deleting / Deletion of missing, null, or invalid values
- null values, deleting / Deletion of missing, null, or invalid values
- invalid values, deleting / Deletion of missing, null, or invalid values
- invalid values, imputation / Imputation of missing, null, or invalid values
- missing values, imputation / Imputation of missing, null, or invalid values
- null values, imputation / Imputation of missing, null, or invalid values
- Data Consumption
- traditional, versus Data Lake / Data Consumption – Traditional versus Data Lake
- about / An introduction to Data Consumption
- data owner / An introduction to Data Consumption
- data steward / An introduction to Data Consumption
- data consumer / An introduction to Data Consumption
- processes, applying / An introduction to Data Consumption
- practical scenarios / Practical Data Consumption scenarios
- Data Consumption tier
- about / The Data Consumption tier, Understanding the Data Consumption tier
- Data Discovery Zone / The Data Discovery Zone
- Data Provisioning Zone / The Data Provisioning Zone
- Data Discovery
- about / Data Discovery and metadata
- enabling / Enabling Data Discovery
- data classification / Data classification
- relation extraction / Relation extraction
- indexing / Indexing data
- performing / Performing Data Discovery
- semantic search / Semantic search
- faceted search / Faceted search
- fuzzy search / Fuzzy search
- Big Data tools / Big Data tools and technologies
- Data Dispatch
- about / Data Dispatch
- use case scenarios / Use case scenarios for Data Dispatch
- reference / Use case scenarios for Data Dispatch
- data governance
- about / An introduction to Data Consumption
- Data Governance
- about / Understanding Data Governance
- features / Introduction to Data Governance
- need for / The need for Data Governance
- Big Data, governing / Governing Big Data in the Data Lake
- traditional, versus Data Lake / Data Governance – Traditional versus Data Lake
- practical scenarios / Practical Data Governance scenarios
- components / Data Governance components
- metadata management / Metadata management and lineage tracking
- data security / Data security and privacy
- data privacy / Data security and privacy
- Information Lifecycle Management / Information Lifecycle Management
- architectural guidance / Architectural guidance
- Big Data tools / Big Data tools and technologies
- Data Governance and Security layer
- about / The Data Governance and Security Layer
- data ingest parallelism
- about / Data ingest parallelism
- limitations, addressing with Data Lake / Addressing the limitations using Data Lake
- Data Intake tier
- about / The Data Intake tier
- Source System Zone / The Source System Zone
- Transient Zone / The Transient Zone
- Raw Zone / The Raw Zone
- Batch Raw Storage / Batch Raw Storage
- real-time Raw Storage / The real-time Raw Storage
- Data Integration
- about / Understanding Data Integration
- architecture / Introduction to Data Integration
- features / Prominent features of Data Integration
- practical scenarios / Practical Data Integration scenarios
- working / The workings of Data Integration
- Data Integration process
- steps / The workings of Data Integration
- Raw data discovery / Raw data discovery
- data quality assessment / Data quality assessment
- data cleansing / Data cleansing
- data transformations / Data transformations
- data enrichment / Data enrichment
- metadata, collecting / Collect metadata and track data lineage
- data lineage, tracking / Collect metadata and track data lineage
- Data Integrity checks, Transient Landing Zone
- about / Data Integrity checks
- record counts, checking / Checking record counts
- column counts, checking for / Checking for column counts
- schema validation checks / Schema validation checks
- Data Lake
- evolution / Before the Data Lake
- about / Before the Data Lake
- need for / Need for Data Lake
- defining / Defining Data Lake
- key benefits / Key benefits of Data Lake
- challenges / Challenges in implementing a Data Lake
- recommendations / When to go for a Data Lake implementation
- architecture / Data Lake architecture
- current and future trends / The current and future trends
- future enterprise trajectories / Data Lake and future enterprise trajectories
- future technologies / Future Data Lake technologies
- Data Lake layers
- about / Architectural composition, Understanding Data Lake layers
- Data Governance and Security layer / The Data Governance and Security Layer
- Information Lifecycle Management layer / The Information Lifecycle Management layer
- Metadata layer / The Metadata Layer
- Data Lake tiers
- about / Architectural composition, Understanding Data Lake tiers
- Data Intake tier / The Data Intake tier
- Data Management tier / The Data Management tier
- Data Consumption tier / The Data Consumption tier
- data lineage processes, Raw Storage Zone
- about / Data lineage processes
- Watermarking process / Watermarking process
- metadata capture / Metadata capture
- Data Management tier
- about / The Data Management tier
- Integration Zone / The Integration Zone
- Enrichment Zone / The Enrichment Zone
- Data Hub Zone / The Data Hub Zone
- Data Management Tier
- about / Introduction to the Data Management Tier
- data partitioning
- about / Data partitioning
- limitations, addressing with Data Lake / Addressing the limitations using Data Lake
- data pipelines
- about / Data pipelines
- limitations, addressing with Data Lake / Addressing the limitations using Data Lake
- Data Provisioning
- about / An introduction to Data Consumption, Data Provisioning and metadata
- metadata / Data Provisioning and metadata
- data publication / Data publication
- data subscription / Data subscription
- functionalities / Data Provisioning functionalities
- data formatting / Data formatting
- data selection / Data selection
- approaches / Data Provisioning approaches
- post-provisioning processes / Post-provisioning processes
- Big Data tools / Big Data tools and technologies
- data quality assessment, Data Integration process
- about / Data quality assessment
- data profiling / Profiling the data
- data security and privacy
- about / Data security and privacy
- Big Data implications / Big Data implications for security and privacy
- data transformations, Data Integration process
- about / Data transformations
- unstructured text transformation techniques / Unstructured text transformation techniques
- structured data transformations / Structured data transformations
- deduplication process, file duplication checks
- file names, comparing / File duplication checks
- schema, comparing / File duplication checks
- content, comparing / File duplication checks
- deep Integrity checks, Raw Storage Zone
- about / Deep Integrity checks
- bit level Integrity checks / Bit Level Integrity checks
- periodic checksum checks / Periodic checksum checks
- Domain Specific Languages (DSL)
- inference / Applications of relation extraction
E
- Elasticsearch
- about / Elasticsearch
- use case scenarios / Use case scenarios for Elasticsearch
- reference / Use case scenarios for Elasticsearch
- Electronic Health Records (EHR)
- implementing / Understanding how semantic technologies work
- Enterprise Resource Planning (ERP) / Practical Data Integration scenarios
- extensibility
- about / Extensibility
- limitations, addressing / Addressing the limitations using Data Lake
- External Data Sources / Understanding Intake tier zones
- extract transform load (ETL) / Batch Raw Storage
F
- faceted search
- about / Faceted search
- feature-based methods
- about / Feature-based methods
- working / Understanding how feature-based methods work
- implementing / Implementation
- features, Data Integration
- loosely coupled integration / Loosely coupled Integration
- user friendly / Ease of use
- secure access / Secure access
- high-quality data / High-quality data
- lineage tracking / Lineage tracking
- file validation checks, Transient Landing Zone
- about / File validation checks
- file duplication checks / File duplication checks
- deduplication process / File duplication checks
- file integrity checks / File integrity checks
- file size checks / File size checks
- file periodicity checks / File periodicity checks
- Fluentd
- about / Fluentd
- use case scenarios / Use case scenarios for Fluentd
- reference / Use case scenarios for Fluentd
- fuzzy search
- about / Fuzzy search
- edit distance / Edit distance
- wildcard / Wildcard and regular expressions
- regular expressions / Wildcard and regular expressions
H
- high-performance tier / The Management Tier
I
- IBM Big Data platform
- about / IBM Big Data platform
- working / Understanding how governance is provided in IBM Big Data platform
- use case scenarios / Use case scenarios for IBM Big Data platform
- reference / Use case scenarios for IBM Big Data platform
- IBM Infosphere Data Explorer
- about / IBM InfoSphere Data Explorer
- use case scenarios / Use case scenarios for IBM InfoSphere Data Explorer
- reference / Use case scenarios for IBM InfoSphere Data Explorer
- IBM InfoSphere Data Replication / Understanding how governance is provided in IBM Big Data platform
- IBM InfoSphere Federation Server / Understanding how governance is provided in IBM Big Data platform
- IBM InfoSphere Guardium / Understanding how governance is provided in IBM Big Data platform
- IBM InfoSphere Information Server / Understanding how governance is provided in IBM Big Data platform
- IBM InfoSphere Master Data Management / Understanding how governance is provided in IBM Big Data platform
- IBM InfoSphere Optim / Understanding how governance is provided in IBM Big Data platform
- imputation techniques
- about / Imputation of missing, null, or invalid values
- mean value imputation / Imputation of missing, null, or invalid values
- median value imputation / Imputation of missing, null, or invalid values
- mode value imputation / Imputation of missing, null, or invalid values
- prediction model imputation / Imputation of missing, null, or invalid values
- Incremental Data Load, structured data
- time stamps / Structured data loading approaches
- partitioning / Structured data loading approaches
- change tables / Structured data loading approaches
- triggers / Structured data loading approaches
- indexing
- about / Indexing data
- inverted index / Inverted index
- applications / Applications of Indexing
- Information Lifecycle Layer (ILM) layer / The Information Lifecycle Management layer
- Information Lifecycle Management
- about / Information Lifecycle Management
- Big Data implications / Big Data implications for ILM
- implementing / Implementing ILM using Data Lake
- Intake Tier / The Intake Tier
- Management Tier / The Management Tier
- Consumption Tier / The Consumption Tier
- Information Lifecycle Management layer
- about / The Information Lifecycle Management layer
- Intake Processing / Understanding connectivity processing
- Intake Processing, for data variety
- about / Understanding Intake Processing for data variety
- structured data / Structured data
- semi-structured data / Semi-structured data
- unstructured data / Unstructured data
- Intake tier
- practical data ingestion scenarios / Practical Data Ingestion scenarios
- architectural guidance / Architectural guidance
- Intake tier zones
- about / Understanding Intake tier zones
- Source System Zone / Understanding Intake tier zones, Source System Zone functionalities
- Transient Landing Zone / Understanding Intake tier zones, Transient Landing Zone functionalities
- Raw Storage Zone / Understanding Intake tier zones, Raw Storage Zone functionalities
- inverted index
- about / Inverted index
- working / Understanding how inverted index works
- implementing / Implementation
K
- Kafka
- about / Kafka
- use case scenarios / Use case scenarios for Kafka
- reference / Use case scenarios for Kafka
- Kinesis
- about / Amazon Kinesis
- use case scenarios / Use case scenarios for Kinesis
- reference / Use case scenarios for Kinesis
L
- Lambda Architecture (LA) / Data Lake and future enterprise trajectories
- Latent Semantic Analysis (LSA) / Applications of Indexing
- Lineage Tracking / Data lineage processes
- Longest Common Subsequence (LCS) / Edit distance
- low-cost storage tier / The Management Tier
M
- Massively Parallel Processing (MPP) system / Scale on demand
- Master Data Management (MDM) / Defining Data Lake, Introduction to Data Governance
- Metadata layer
- about / The Metadata Layer
- Mirroring / Periodic checksum checks
O
- offline archive tier / The Management Tier
- Online Analytical Processing (OLAP) systems / Before the Data Lake
- online archive tier / The Management Tier
- Online Transaction Processing (OLTP) systems / Before the Data Lake
P
- partition / Addressing the limitations using Data Lake
- Pentaho
- about / Pentaho
- use case scenarios / Pentaho, Use case scenarios for Pentaho
- Principal Component Analysis (PCA) / Structured data transformations
R
- Raw Storage Zone
- about / Raw Storage Zone functionalities
- functionalities / Raw Storage Zone functionalities
- data lineage processes / Data lineage processes
- deep integrity checks / Deep Integrity checks
- security and governance / Security and governance
- Information Lifecycle Management / Information Lifecycle Management
- relation extraction
- about / Relation extraction
- relationships, extracting from unstructured data / Extracting relationships from unstructured data
- relationships, extracting from structured data / Extracting Relationships from structured data
- applications / Applications of relation extraction
- relationships, extracting from unstructured data
- about / Extracting relationships from unstructured data
- feature-based methods / Feature-based methods
- semantic technologies / Semantic technologies
- Research Data Management (RDM) / Defining Data Lake
S
- Scale on demand
- about / Scale on demand
- limitations, addressing with Data Lake / Addressing the limitations using Data Lake
- security issues, in Data Lake tiers
- about / Security issues in the Data Lake tiers
- Intake Tier / The Intake Tier
- Management Tier / The Management Tier
- Consumption Tier / The Consumption Tier
- semantic search
- about / Semantic search
- word sense disambiguation / Word sense disambiguation
- Latent Semantic Analysis (LSA) / Latent Semantic Analysis
- semantic technologies / Understanding how semantic technologies work
- about / Semantic technologies
- working / Understanding how semantic technologies work
- Web Ontology Language (OWL) / Understanding how semantic technologies work
- SPARQL Protocol And RDF Query Language (SPARQL) / Understanding how semantic technologies work
- inference / Understanding how semantic technologies work
- implementing / Implementation
- semi-structured data
- about / Semi-structured data
- need for integrating, in Data Lake / The need for integrating semi-structured data in the Data Lake
- loading approaches / Semi-structured data loading approaches
- Source System Zone
- about / Understanding Intake tier zones
- functionalities / Source System Zone functionalities
- connectivity processing / Understanding connectivity processing
- data variety / Understanding Intake Processing for data variety
- Splunk
- about / Splunk
- use case scenarios / Use case scenarios for Splunk
- reference / Use case scenarios for Splunk
- Sqoop
- about / Sqoop
- use case scenarios / Use case scenarios for Sqoop
- reference / Use case scenarios for Sqoop
- Storm
- about / Apache Storm
- use case scenarios / Use case scenarios for Storm
- reference / Use case scenarios for Storm
- structured data
- about / Structured data
- examples / Structured data
- need for integrating, in Data Lake / The need for integrating Structured Data in the Data Lake
- loading approaches / Structured data loading approaches
- Full Data Load / Structured data loading approaches
- Incremental Data Load / Structured data loading approaches
- structured data transformations
- about / Structured data transformations
- attribute-level structured data transformations / Structured data transformations
- table-level structured data transformations / Structured data transformations
- Supply Chain Management (SCM) / Practical Data Integration scenarios
- Support Vector Machine (SVM) / Structured data transformations
- Syncsort
- use case scenarios / Use case scenarios for Syncsort
T
- Tableau
- about / Tableau
- use case scenarios / Use case scenarios for Tableau
- reference / Use case scenarios for Tableau
- Talend
- about / Talend
- use case scenarios / Use case scenarios for Talend
- traditional data integration, versus Data Lake
- about / Traditional Data Integration versus Data Lake
- data pipelines / Data pipelines
- data partitioning / Data partitioning
- scale on demand / Scale on demand
- data ingest parallelism / Data ingest parallelism
- extensibility / Extensibility
- traditional data warehouse (DW) systems / Need for Data Lake
- Transient Landing Zone
- about / Transient Landing Zone functionalities
- functionalities / Transient Landing Zone functionalities
- file validation checks / File validation checks
- Data Integrity checks / Data Integrity checks
U
- unstructured data
- about / Unstructured data
- examples / Unstructured data
- need for integrating, in Data Lake / The need for integrating Unstructured data in the Data Lake
- loading approaches / Unstructured data loading approaches
- unstructured data, classifying
- about / Classifying unstructured data
- named entity recognition / Named entity recognition
- Conditional Random Fields (CRF) / Named entity recognition
- Maximum Entropy Markov Models (MEMM) / Named entity recognition
- Hidden Markov Models (HMM) / Named entity recognition
- topic models / Topic modeling
- Latent Dirichlet Allocation (LDA) / Topic modeling
- Hierarchical Dirichlet Process (HDP) / Topic modeling
- text clustering / Text clustering
W
- WebHDFS
- use case scenarios / Use case scenarios for WebHDFS
- reference / Use case scenarios for WebHDFS