Reader small image

You're reading from  Mastering Spark for Data Science

Product typeBook
Published inMar 2017
PublisherPackt
ISBN-139781785882142
Edition1st Edition
Concepts
Right arrow
Authors (4):
Andrew Morgan
Andrew Morgan
author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

Antoine Amend
Antoine Amend
author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

Matthew Hallett
Matthew Hallett
author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

David George
David George
author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George

View More author details
Right arrow

Chapter 13. Secure Data

Throughout this book, we have visited many areas of data science, often straying into those that are not traditionally associated with a data scientist's core working knowledge. In particular, we dedicated an entire chapter, Chapter 2, Data Acquisition, to data ingestion, which explains how to solve an issue that is always present, but rarely acknowledged or addressed adequately. In this chapter, we will visit another of those often overlooked fields, secure data. More specifically, how to protect your data and analytic results at all stages of the data life cycle. This ranges from ingestion, right through to presentation, at all times considering the important architectural and scalability requirements that naturally form the Spark paradigm.

In this chapter, we will cover the following topics:

  • How to implement coarse-grained data access controls using HDFS ACLs

  • A guide to fine-grained security, with explanations using the Hadoop ecosystem

  • How to ensure data is always...

Data security


The final piece to our data architecture is security, and in this chapter we will discover that data security is always important, and the reasons for this. Given the huge increase in the volume and variety of data in recent times, caused by many factors, but in no small part due to the popularity of the Internet and related technologies, there is a growing need to provide fully scalable and secure solutions. We are going to explore those solutions along with the confidentiality, privacy, and legal concerns associated with the storing, processing, and handling of data; we will relate these to the tools and techniques introduced in previous chapters.

We will continue on by explaining the technical issues involved in securing data at scale and introduce ideas and techniques that tackle these concerns using a variety of access, classification, and obfuscation strategies. As in previous chapters, ideas are demonstrated with examples using the Hadoop ecosystem, and public cloud infrastructure...

Authentication and authorization


Authentication is related to the mechanisms used to ensure that the user is who they say they are and operates at two key levels, namely, local and remote.

Authentication can take various forms, the most common is user login, but other examples include fingerprint reading, iris scanning, and PIN number entry. User logins can be managed on a local basis, as you would on your personal computer, for example, or on a remote basis using a tool such as Lightweight Directory Access Protocol (LDAP). Managing users remotely provides roaming user profiles that are independent of any particular hardware and can be managed independently of the user. All of these methods execute at the operating system level. There are other mechanisms that sit at the application layer and provide authentication for services, such as Google OAuth.

Alternative authentication methods have their own pros and cons, a particular implementation should be understood thoroughly before declaring...

Access


We have thus far concentrated only on the specific ideas of ensuring that a user is who they say they are and that only the correct users can view and use data. However, once we have taken the appropriate steps and confirmed these details, we still need to ensure that this data is secure when the user is actually using it; there are a number of areas to consider:

  • Is the user allowed to see all of the information in the data? Perhaps they are to be limited to certain rows, or even certain parts of certain rows.

  • Is the data secure when the user runs analytics across it? We need to ensure that the data isn't transmitted as plain text and therefore open to man-in-the-middle attacks.

  • Is the data secure once the user has completed their task? There's no point in ensuring that the data is super secure at all stages, only to write plain text results to an insecure area.

  • Can conclusions be made from the aggregation of data? Even if the user only has access to certain rows of a dataset, let's say...

Encryption


Arguably the most obvious and well known method of protecting data is encryption. We would use this whether our data is in transit or at rest, so, virtually all of the time, apart from when the data is actually being processed inside memory. The mechanics of encryption are different depending upon the state of the data.

Data at rest

Our data will always need to be stored somewhere, whether it be HDFS, S3, or local disk. If we have taken all of the precautions of ensuring that users are authorized and authenticated, there is still the issue of plain text actually existing on the disk. With direct access to the disk, either physically or by accessing it through a lower level in the OSI stack, it is fairly trivial to stream the entire contents and glean the plain text data.

If we encrypt data, then we are protected from this type of attack. The encryption can also exist at different levels, either by encrypting the data at the application layer using software, or by encrypting it at...

Data disposal


Secure data should have an agreed life cycle. This will be set by a data authority when working in a commercial context, and it will dictate what state the data should be in at any given point during that life cycle. For example, a particular dataset may be labeled as sensitive - requires encryption for the first year of its life, followed by private - no encryption, and finally, disposal. The lengths of time and the rules applied will entirely depend upon the organization and the data itself - some data expires after just a few days, some after fifty years. The life cycle ensures that everyone knows exactly how the data should be treated, and it also ensures that older data is not needlessly taking up valuable disk space or breaching any data protection laws.

The correct disposal of data from secure systems is perhaps one of the most mis-understood areas of data security. Interestingly, it doesn't always involve a complete and/or destructive removal process. Examples where...

Kerberos authentication


Many installations of Apache Spark use Kerberos to provide security and authentication to services such as HDFS and Kafka. It's also especially common when integrating with third-party databases and legacy systems. As a commercial data scientist, at some point, you'll probably find yourself in a situation where you'll have to work with data in a Kerberized environment, so, in this part of the chapter, we'll cover the basics of Kerberos - what it is, how it works, and how to use it.

Kerberos is a third-party authentication technique that's particularly useful where the primary form of communication is over a network, which makes it ideal for Apache Spark. It's used in preference to alternative methods of authentication, for example, username and password, because it provides the following benefits:

  • No passwords are stored in plain text in application configuration files

  • Facilitates centralized management of services, identities, and permissions

  • Establishes a mutual trust...

Security ecosystem


We will conclude with a brief rundown of some of the popular security tools we may encounter while developing with Apache Spark - and some advice about when to use them.

Apache sentry

As the Hadoop ecosystem grows ever larger, products such as Hive, HBase, HDFS, Sqoop, and Spark all have different security implementations. This means that duplicate policies are often required across the product stack in order to provide the user with a seamless experience, as well as enforce the overarching security manifest. This can quickly become complicated and time consuming to manage, which often leads to mistakes and even security breaches (whether intentional or otherwise). Apache Sentry pulls many of the mainstream Hadoop products together, particularly with Hive/HS2, to provide fine-grained (up to column level) controls.

Using ACLs is simple, but high maintenance. The setting of permissions for a large number of new files and amending umasks is very cumbersome and time consuming...

Your Secure Responsibility


Now that we've covered the common security use cases and discussed some of the tools that a data scientist needs to be aware of in their everyday activities, there's one last important item to note. While in their custody, the responsibility for data, including its security and integrity, lies with the data scientist. This is usually true whether or not you are explicitly told. Therefore, it is crucial that you take this responsibility seriously and take all the necessary precautions when handling and processing data. If needed, also be ready to communicate to others their responsibility. We all need to ensure that we are not held responsible for a breach off-site; this can be achieved by highlighting the issue or, indeed, even having a written contract with the off-site service provider outlining their security arrangements. To see a real-world example of what can go wrong when you don't pay proper attention to due diligence, have a look at some security notes...

Summary


In this chapter, we have explored the topic of data security and explained some of the surrounding issues. We have discovered that not only is there technical knowledge to master, but also that a data security mindset is just as important. Data security is often overlooked and, therefore, taking a systematic approach, and educating others, is a key responsibility for mastering data science.

We have explained the data security life cycle and outlined the most important areas of responsibility, including authorization, authentication and access, along with related examples and use cases. We have also explored the Hadoop security ecosystem and described the important open source solutions currently available.

A significant part of this chapter was dedicated to building a Hadoop InputFormat compressor that operates as a data encryption utility that can be used with Spark. Appropriate configuration allows the codec to be used in a variety of key areas, crucially when spilling shuffled records...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Spark for Data Science
Published in: Mar 2017Publisher: PacktISBN-13: 9781785882142
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George