In this chapter, we will understand the following topics:
- Data wrangling
- Data masking
- Data security
In this chapter, we will understand the following topics:
If you have some experience working on data of some sort, you will recollect that most of the time data needs to be preprocessed so that we can further use it as part of a bigger analysis. This process is called data wrangling.
Let's see what the typical flow in this process looks like:
Let's try to understand these in detail.
Even though not a part of data wrangling, this phase deals with the process of acquiring data from somewhere. Typically, all data is generated and stored in a central location or is available in files located on some shared storage...
Businesses that deal with customer data have to make sure that the PII (personally identifiable information) of these customers is not moving freely around the entire data pipeline. This criterion is applicable not only to customer data but also to any other type of data that is considered classified, as per standards such as GDPR, SOX, and so on. In order to make sure that we protect the privacy of customers, employees, contractors, and vendors, we need to take the necessary precautions to ensure that when the data goes through several pipelines, users of the data see only anonymized data. The level of anonymization we do depends upon the standards the company adheres to and also the prevailing country standards.
So, data masking can be called the process of hiding/transforming portions of original data with other data without losing the meaning or context.
In this...
Data has become a very important asset for businesses when making very critical decisions. As the complexity of the infrastructure that generates and uses this data, its very important to have some control over the access patterns of this data. In the Hadoop ecosystem, we have Apache Ranger, which is another open source project that helps in managing the security of big data.
Apache Ranger is an application that enables data architects to implement security policies on a big data ecosystem. The goal of this project is to provide a unified way for all Hadoop applications to adhere to the security guidelines that are defined.
Here are some of the features of Apache Ranger:
In this chapter, we learned about the different data life cycle stages, including when data is created, shared, maintained, archived, retained, and deleted.
This chapter gave you a detailed understanding of how big data is managed, considering the fact that it is either unstructured or semi-structured and it has a fast arrival rate and large volume.
As the complexity of the infrastructure that generates and uses data in business organizations has increased drastically, it has become imperative to secure your data properly. This chapter further covered data security tools, such as Apache Ranger, and patterns to help us learn how to have control over the access patterns of data.
In the next chapter, we will take a look at Hadoop installation, its architecture and key components.