Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Practical Big Data Analytics

You're reading from  Practical Big Data Analytics

Product type Book
Published in Jan 2018
Publisher Packt
ISBN-13 9781783554393
Pages 412 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Nataraj Dasgupta Nataraj Dasgupta
Profile icon Nataraj Dasgupta

Table of Contents (16) Chapters

Title Page
Packt Upsell
Contributors
Preface
Too Big or Not Too Big Big Data Mining for the Masses The Analytics Toolkit Big Data With Hadoop Big Data Mining with NoSQL Spark for Big Data Analytics An Introduction to Machine Learning Concepts Machine Learning Deep Dive Enterprise Data Science Closing Thoughts on Big Data External Data Science Resources Other Books You May Enjoy

Types of Big Data


Data can be broadly classified as being structured, unstructured, or semi-structured. Although these distinctions have always existed, the classification of data into these categories has become more prominent with the advent of big data.

Structured

Structured data, as the name implies, indicates datasets that have a defined organizational structure such as Microsoft Excel or CSV files. In pure database terms, the data should be representable using a schema. As an example, the following table representing the top five happiest countries in the world published by the United Nations in its 2017 World Happiness Index ranking would be an atypical representation of structured data.

We can clearly define the data types of the columns--Rank, Score, GDP per capita, Social support, Healthy life expectancy, Trust, Generosity, and Dystopia are numerical columns, whereas Country is represented using letters, or more specifically, strings.

Refer to the following table for a little more clarity:

Rank

Country

Score

GDP per capita

Social support

Healthy life expectancy

Generosity

Trust

Dystopia

1

Norway

7.537

1.616

1.534

0.797

0.362

0.316

2.277

2

Denmark

7.522

1.482

1.551

0.793

0.355

0.401

2.314

3

Iceland

7.504

1.481

1.611

0.834

0.476

0.154

2.323

4

Switzerland

7.494

1.565

1.517

0.858

0.291

0.367

2.277

5

Finland

7.469

1.444

1.54

0.809

0.245

0.383

2.43

 

World Happiness Report, 2017 [Source: https://en.wikipedia.org/wiki/World_Happiness_Report#cite_note-4]

Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra, and Hive in the open source domain are examples of technologies that provide the ability to manage and query structured data.

Unstructured

Unstructured data consists of any dataset that does not have a predefined organizational schema as in the table in the prior section. Spoken words, music, videos, and even books, including this one, would be considered unstructured. This by no means implies that the content doesn’t have organization. Indeed, a book has a table of contents, chapters, subchapters, and an index--in that sense, it follows a definite organization.

However, it would be futile to represent every word and sentence as being part of a strict set of rules. A sentence can consist of words, numbers, punctuation marks, and so on and does not have a predefined data type as spreadsheets do. To be structured, the book would need to have an exact set of characteristics in every sentence, which would be both unreasonable and impractical.

Note

Data from social media, such as posts on Twitter, messages from friends on Facebook, and photos on Instagram, are all examples of unstructured data.

Unstructured data can be stored in various formats. They can be Blobs or, in the case of textual data, freeform text held in a data storage medium. For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally used to query, index, and other operations.

Semi-structured

Semi-structured data refers to data that has both the elements of an organizational schema as well as aspects that are arbitrary. A personal phone diary (increasingly rare these days!) with columns for name, address, phone number, and notes could be considered a semi-structured dataset. The user might not be aware of the addresses of all individuals and hence some of the entries may have just a phone number and vice versa.

Similarly, the column for notes may contain additional descriptive information (such as a facsimile number, name of a relative associated with the individual, and so on). It is an arbitrary field that allows the user to add complementary information. The columns for name, address, and phone number can thus be considered structured in the sense that they can be presented in a tabular format, whereas the notes section is unstructured in the sense that it may contain an arbitrary set of descriptive information that cannot be represented in the other columns in the diary.

In computing, semi-structured data is usually represented by formats, such as JSON, that can encapsulate both structured as well as schemaless or arbitrary associations, generally using key-value pairs. A more common example could be email messages, which have both a structured part, such as name of the sender, time when the message was received, and so on, that is common to all email messages and an unstructured portion represented by the body or content of the email.

Platforms such as Mongo and CouchDB are generally used to store and query semi-structured datasets.

You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018 Publisher: Packt ISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}