Reader small image

You're reading from  Mastering Spark for Data Science

Product typeBook
Published inMar 2017
PublisherPackt
ISBN-139781785882142
Edition1st Edition
Concepts
Right arrow
Authors (4):
Andrew Morgan
Andrew Morgan
author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

Antoine Amend
Antoine Amend
author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

Matthew Hallett
Matthew Hallett
author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

David George
David George
author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George

View More author details
Right arrow

Introducing the Big Data ecosystem


Data management is of particular importance, especially when the data is in flux; either constantly changing or being routinely produced and updated. What is needed in these cases is a way of storing, structuring, and auditing data that allows for the continuous processing and refinement of models and results.

Here, we describe how to best hold and organize your data to integrate with Apache Spark and related tools within the context of a data architecture that is broad enough to fit the everyday requirement.

Data management

Even if, in the medium term, you only intend to play around with a bit of data at home; then without proper data management, more often than not, efforts will escalate to the point where it is easy to lose track of where you are and mistakes will happen. Taking the time to think about the organization of your data, and in particular, its ingestion, is crucial. There's nothing worse than waiting for a long running analytic to complete, collating the results and producing a report, only to discover you used the wrong version of data, or data is incomplete, has missing fields, or even worse you deleted your results!

The bad news is that, despite its importance, data management is an area that is consistently overlooked in both commercial and non-commercial ventures, with precious few off-the-shelf solutions available. The good news is that it is much easier to do great data science using the fundamental building blocks that this chapter describes.

Data management responsibilities

When we think about data, it is easy to overlook the true extent of the scope of the areas we need to consider. Indeed, most data "newbies" think about the scope in this way:

  1. Obtain data

  2. Place the data somewhere (anywhere)

  3. Use the data

  4. Throw the data away

In reality, there are a large number of other considerations, it is our combined responsibility to determine which ones apply to a given work piece. The following data management building blocks assist in answering or tracking some important questions about the data:

  • File integrity

    • Is the data file complete?

    • How do you know?

    • Was it part of a set?

    • Is the data file correct?

    • Was it tampered with in transit?

  • Data integrity

    • Is the data as expected?

    • Are all of the fields present?

    • Is there sufficient metadata?

    • Is the data quality sufficient?

    • Has there been any data drift?

  • Scheduling

    • Is the data routinely transmitted?

    • How often does the data arrive?

    • Was the data received on time?

    • Can you prove when the data was received?

    • Does it require acknowledgement?

  • Schema management

    • Is the data structured or unstructured?

    • How should the data be interpreted?

    • Can the schema be inferred?

    • Has the data changed over time?

    • Can the schema be evolved from the previous version?

  • Version Management

    • What is the version of the data?

    • Is the version correct?

    • How do you handle different versions of the data?

    • How do you know which version you're using?

  • Security

    • Is the data sensitive?

    • Does it contain personally identifiable information (PII)?

    • Does it contain personal health information (PHI)?

    • Does it contain payment card information (PCI)?

    • How should I protect the data?

    • Who is entitled to read/write the data?

    • Does it require anonymization/sanitization/obfuscation/encryption?

  • Disposal

    • How do we dispose of the data?

    • When do we dispose of the data?

If, after all that, you are still not convinced, before you go ahead and write that bash script using the gawk and crontab commands, keep reading and you will soon see that there is a far quicker, flexible, and safer method that allow you to start small and incrementally create commercial grade ingestion pipelines!

The right tool for the job

Apache Spark is the emerging de facto standard for scalable data processing. At the time of writing this book, it is the most active Apache Software Foundation (ASF) project and has a rich variety of companion tools available. There are new projects appearing every day, many of which overlap in functionality. So it takes time to learn what they do and decide whether they are appropriate to use. Unfortunately, there's no quick way around this. Usually, specific trade-offs must be made on a case-by-case basis; there is rarely a one-size-fits-all solution. Therefore, the reader is encouraged to explore the available tools and choose wisely!

Various technologies are introduced throughout this book, and the hope is that they will provide the reader with a taster of some of the more useful and practical ones to a level where they may start utilizing them in their own projects. And further, we hope to show that if the code is written carefully, technologies may be interchanged through clever use of Application Program Interface (APIs) (or high order functions in Spark Scala) even when a decision is proved to be incorrect.

Previous PageNext Page
You have been reading a chapter from
Mastering Spark for Data Science
Published in: Mar 2017Publisher: PacktISBN-13: 9781785882142
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George