Packt+ | Advance your knowledge in tech

You're reading from Mastering Spark for Data Science

Product typeBook

Published inMar 2017

PublisherPackt

ISBN-139781785882142

Edition1st Edition

Tools

Apache Spark Hadoop

Concepts

Data Science

Authors (4):

Andrew Morgan

Antoine Amend

Matthew Hallett

David George

View More author details

Chapter 13. Secure Data

Throughout this book, we have visited many areas of data science, often straying into those that are not traditionally associated with a data scientist's core working knowledge. In particular, we dedicated an entire chapter, Chapter 2, Data Acquisition, to data ingestion, which explains how to solve an issue that is always present, but rarely acknowledged or addressed adequately. In this chapter, we will visit another of those often overlooked fields, secure data. More specifically, how to protect your data and analytic results at all stages of the data life cycle. This ranges from ingestion, right through to presentation, at all times considering the important architectural and scalability requirements that naturally form the Spark paradigm.

In this chapter, we will cover the following topics:

How to implement coarse-grained data access controls using HDFS ACLs
A guide to fine-grained security, with explanations using the Hadoop ecosystem
How to ensure data is always...

Data security

The final piece to our data architecture is security, and in this chapter we will discover that data security is always important, and the reasons for this. Given the huge increase in the volume and variety of data in recent times, caused by many factors, but in no small part due to the popularity of the Internet and related technologies, there is a growing need to provide fully scalable and secure solutions. We are going to explore those solutions along with the confidentiality, privacy, and legal concerns associated with the storing, processing, and handling of data; we will relate these to the tools and techniques introduced in previous chapters.

We will continue on by explaining the technical issues involved in securing data at scale and introduce ideas and techniques that tackle these concerns using a variety of access, classification, and obfuscation strategies. As in previous chapters, ideas are demonstrated with examples using the Hadoop ecosystem, and public cloud infrastructure...

Authentication and authorization

Authentication is related to the mechanisms used to ensure that the user is who they say they are and operates at two key levels, namely, local and remote.

Authentication can take various forms, the most common is user login, but other examples include fingerprint reading, iris scanning, and PIN number entry. User logins can be managed on a local basis, as you would on your personal computer, for example, or on a remote basis using a tool such as Lightweight Directory Access Protocol (LDAP). Managing users remotely provides roaming user profiles that are independent of any particular hardware and can be managed independently of the user. All of these methods execute at the operating system level. There are other mechanisms that sit at the application layer and provide authentication for services, such as Google OAuth.

Alternative authentication methods have their own pros and cons, a particular implementation should be understood thoroughly before declaring...

Access

We have thus far concentrated only on the specific ideas of ensuring that a user is who they say they are and that only the correct users can view and use data. However, once we have taken the appropriate steps and confirmed these details, we still need to ensure that this data is secure when the user is actually using it; there are a number of areas to consider:

Is the user allowed to see all of the information in the data? Perhaps they are to be limited to certain rows, or even certain parts of certain rows.
Is the data secure when the user runs analytics across it? We need to ensure that the data isn't transmitted as plain text and therefore open to man-in-the-middle attacks.
Is the data secure once the user has completed their task? There's no point in ensuring that the data is super secure at all stages, only to write plain text results to an insecure area.
Can conclusions be made from the aggregation of data? Even if the user only has access to certain rows of a dataset, let's say...

Encryption

Arguably the most obvious and well known method of protecting data is encryption. We would use this whether our data is in transit or at rest, so, virtually all of the time, apart from when the data is actually being processed inside memory. The mechanics of encryption are different depending upon the state of the data.

Data at rest

Our data will always need to be stored somewhere, whether it be HDFS, S3, or local disk. If we have taken all of the precautions of ensuring that users are authorized and authenticated, there is still the issue of plain text actually existing on the disk. With direct access to the disk, either physically or by accessing it through a lower level in the OSI stack, it is fairly trivial to stream the entire contents and glean the plain text data.

If we encrypt data, then we are protected from this type of attack. The encryption can also exist at different levels, either by encrypting the data at the application layer using software, or by encrypting it at...

Data disposal

Secure data should have an agreed life cycle. This will be set by a data authority when working in a commercial context, and it will dictate what state the data should be in at any given point during that life cycle. For example, a particular dataset may be labeled as sensitive - requires encryption for the first year of its life, followed by private - no encryption, and finally, disposal. The lengths of time and the rules applied will entirely depend upon the organization and the data itself - some data expires after just a few days, some after fifty years. The life cycle ensures that everyone knows exactly how the data should be treated, and it also ensures that older data is not needlessly taking up valuable disk space or breaching any data protection laws.

The correct disposal of data from secure systems is perhaps one of the most mis-understood areas of data security. Interestingly, it doesn't always involve a complete and/or destructive removal process. Examples where...

Kerberos authentication

Many installations of Apache Spark use Kerberos to provide security and authentication to services such as HDFS and Kafka. It's also especially common when integrating with third-party databases and legacy systems. As a commercial data scientist, at some point, you'll probably find yourself in a situation where you'll have to work with data in a Kerberized environment, so, in this part of the chapter, we'll cover the basics of Kerberos - what it is, how it works, and how to use it.

Kerberos is a third-party authentication technique that's particularly useful where the primary form of communication is over a network, which makes it ideal for Apache Spark. It's used in preference to alternative methods of authentication, for example, username and password, because it provides the following benefits:

No passwords are stored in plain text in application configuration files
Facilitates centralized management of services, identities, and permissions
Establishes a mutual trust...

Security ecosystem

We will conclude with a brief rundown of some of the popular security tools we may encounter while developing with Apache Spark - and some advice about when to use them.

Apache sentry

As the Hadoop ecosystem grows ever larger, products such as Hive, HBase, HDFS, Sqoop, and Spark all have different security implementations. This means that duplicate policies are often required across the product stack in order to provide the user with a seamless experience, as well as enforce the overarching security manifest. This can quickly become complicated and time consuming to manage, which often leads to mistakes and even security breaches (whether intentional or otherwise). Apache Sentry pulls many of the mainstream Hadoop products together, particularly with Hive/HS2, to provide fine-grained (up to column level) controls.

Using ACLs is simple, but high maintenance. The setting of permissions for a large number of new files and amending umasks is very cumbersome and time consuming...

Your Secure Responsibility

Now that we've covered the common security use cases and discussed some of the tools that a data scientist needs to be aware of in their everyday activities, there's one last important item to note. While in their custody, the responsibility for data, including its security and integrity, lies with the data scientist. This is usually true whether or not you are explicitly told. Therefore, it is crucial that you take this responsibility seriously and take all the necessary precautions when handling and processing data. If needed, also be ready to communicate to others their responsibility. We all need to ensure that we are not held responsible for a breach off-site; this can be achieved by highlighting the issue or, indeed, even having a written contract with the off-site service provider outlining their security arrangements. To see a real-world example of what can go wrong when you don't pay proper attention to due diligence, have a look at some security notes...

Summary

In this chapter, we have explored the topic of data security and explained some of the surrounding issues. We have discovered that not only is there technical knowledge to master, but also that a data security mindset is just as important. Data security is often overlooked and, therefore, taking a systematic approach, and educating others, is a key responsibility for mastering data science.

We have explained the data security life cycle and outlined the most important areas of responsibility, including authorization, authentication and access, along with related examples and use cases. We have also explored the Hadoop security ecosystem and described the important open source solutions currently available.

A significant part of this chapter was dedicated to building a Hadoop InputFormat compressor that operates as a data encryption utility that can be used with Spark. Appropriate configuration allows the codec to be used in a variety of key areas, crucially when spilling shuffled records...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Spark for Data Science

Published in: Mar 2017Publisher: PacktISBN-13: 9781785882142

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (4)

Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George

Other recommended products

Related to this chapter

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Mastering Machine Learning with Spark 2.x

The purpose of machine learning is to build systems that learn from data. With the meteoric rise of machine learning, developers are now keen on finding out how can they make their Spark applications smarter. The book commences by defining machine learning primitives by the MLlib and H2O libraries. You will learn how to use Binary classification to detect the Higgs Boson particle in the huge amount of data produced by CERN particle collider and classify daily health activities using ensemble Methods for Multi-Class Classification. Finally, you will build different pattern mining models using MLlib, perform complex manipulation of DataFrames using Spark and Spark SQL, and deploy your app in a Spark streaming environment.

BookAug 2017340 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages