Packt+ | Advance your knowledge in tech

You're reading from Hadoop Essentials

Product typeBook

Published inApr 2015

Reading LevelIntermediate

PublisherPackt

ISBN-139781784396688

Edition1st Edition

Languages

Java

Tools

Hadoop

Concepts

Data Processing

Author (1)

Shiva Achari

Understanding big data

Actually, big data is a terminology which refers to challenges that we are facing due to exponential growth of data in terms of V problems. The challenges can be subdivided into the following phases:

Capture
Storage
Search
Sharing
Analytics
Visualization

Big data systems refer to technologies that can process and analyze data, which we discussed as volume, velocity, and variety data problems. The technologies that can solve big data problems should use the following architectural strategy:

Distributed computing system
Massively parallel processing (MPP)
NoSQL (Not only SQL)
Analytical database

The structure is as follows:

Big data systems use distributed computing and parallel processing to handle big data problems. Apart from distributed computing and MPP, there are other architectures that can solve big data problems that are toward database environment based system, which are NoSQL and Advanced SQL.

NoSQL

A NoSQL database is a widely adapted technology due to the schema less design, and its ability to scale up vertically and horizontally is fairly simple and in much less effort. SQL and RDBMS have ruled for more than three decades, and it performs well within the limits of the processing environment, and beyond that the RDBMS system performance degrades, cost increases, and manageability decreases, we can say that NoSQL provides an edge over RDBMS in these scenarios.

Note

One important thing to mention is that NoSQLs do not support all ACID properties and are highly scalable, provide availability, and are also fault tolerant. NoSQL usually provides either consistency or availability (availability of nodes for processing), depending upon the architecture and design.

Types of NoSQL databases

As the NoSQL databases are nonrelational they have different sets of possible architecture and design. Broadly, there are four general types of NoSQL databases, based on how the data is stored:

Key-value store: These databases are designed for storing data in a key-value store. The key can be custom, can be synthetic, or can be autogenerated, and the value can be complex objects such as XML, JSON, or BLOB. Key of data is indexed for faster access to the data and improving the retrieval of value. Some popular key-value type databases are DynamoDB, Azure Table Storage (ATS), Riak, and BerkeleyDB.
Column store: These databases are designed for storing data as a group of column families. Read/write operation is done using columns, rather than rows. One of the advantages is the scope of compression, which can efficiently save space and avoid memory scan of the column. Due to the column design, not all files are required to be scanned, and each column file can be compressed, especially if a column has many nulls and repeating values. A column stores databases that are highly scalable and have very high performance architecture. Some popular column store type databases are HBase, BigTable, Cassandra, Vertica, and Hypertable.
Document database: These databases are designed for storing, retrieving, and managing document-oriented information. A document database expands on the idea of key-value stores where values or documents are stored using some structure and are encoded in formats such as XML, YAML, or JSON, or in binary forms such as BSON, PDF, Microsoft Office documents (MS Word, Excel), and so on. The advantage in storing in an encoded format like XML or JSON is that we can search with the key within the document of a data, and it is quite useful in ad hoc querying and semi-structured data. Some popular document-type databases are MongoDB and CouchDB.
Graph database: These databases are designed for data whose relations are well represented as trees or a graph, and has elements, usually with nodes and edges, which are interconnected. Relational databases are not so popular in performing graph-based queries as they require a lot of complex joins, and thus managing the interconnection becomes messy. Graph theoretic algorithms are useful for prediction, user tracking, clickstream analysis, calculating the shortest path, and so on, which will be processed by graph databases much more efficiently as the algorithms themselves are complex. Some popular graph-type databases are Neo4J and Polyglot.

Analytical database

An analytical database is a type of database built to store, manage, and consume big data. Analytical databases are vendor-managed DBMS, which are optimized for processing advanced analytics that involves highly complex queries on terabytes of data and complex statistical processing, data mining, and NLP (natural language processing). Examples of analytical databases are Vertica (acquired by HP), Aster Data (acquired by Teradata), Greenplum (acquired by EMC), and so on.

You have been reading a chapter from

Hadoop Essentials

Published in: Apr 2015Publisher: PacktISBN-13: 9781784396688

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Shiva Achari

Shiva Achari has over 8 years of extensive industry experience and is currently working as a Big Data Architect consultant with companies such as Oracle and Teradata. Over the years, he has architected, designed, and developed multiple innovative and high-performance large-scale solutions, such as distributed systems, data centers, big data management tools, SaaS cloud applications, Internet applications, and Data Analytics solutions. He is also experienced in designing big data and analytics applications, such as ingestion, cleansing, transformation, correlation of different sources, data mining, and user experience in Hadoop, Cassandra, Solr, Storm, R, and Tableau. He specializes in developing solutions for the big data domain and possesses sound hands-on experience on projects migrating to the Hadoop world, new developments, product consulting, and POC. He also has hands-on expertise in technologies such as Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper, Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java, and J2EE. He has been involved in reviewing Mastering Hadoop, Packt Publishing. Shiva has expertise in requirement analysis, estimations, technology evaluation, and system architecture along with domain experience in telecoms, Internet applications, document management, healthcare, and media. Currently, he is supporting presales activities such as writing technical proposals (RFP), providing technical consultation to customers, and managing deliveries of big data practice groups in Teradata.
Read more about Shiva Achari

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages