Home

Data

Big Data Forensics: Learning Hadoop Investigations

Book

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Publication date:: August 2015
Publisher: Packt
Pages: 264
ISBN: 9781785288104

Chapter 1. Starting Out with Forensic Investigations and Big Data

Big Data forensics is a new type of forensics, just as Big Data is a new way of solving the challenges presented by large, complex data. Thanks to the growth in data and the increased value of storing more data and analyzing it faster—Big Data solutions have become more common and more prominently positioned within organizations. As such, the value of Big Data systems has grown, often storing data used to drive organizational strategy, identify sales, and many different modes of electronic communication. The forensic value of such data is obvious: if the data is useful to an organization, then the data is valuable to an investigation of that organization. The information in a Big Data system is not only inherently valuable, but the data is most likely organized and analyzed in such a way to identify how the organization treated the data.

Big Data forensics is the forensic collection and analysis of Big Data systems. Traditional computer forensics typically focuses on more common sources of data, such as mobile devices and laptops. Big Data forensics is not a replacement for traditional forensics. Instead, Big Data forensics augments the existing forensics body of knowledge to handle the massive, distributed systems that require different forensic tools and techniques.

Traditional forensic tools and methods are not always well-suited for Big Data. The tools and techniques used in traditional forensics are most commonly designed for the collection and analysis of unstructured data (for example, e-mail and document files). Forensics of such data typically hinges on metadata and involves the calculation of an MD5 or SHA-1 checksum. With Big Data systems, the large volume of data and how the data is stored do not lend themselves well to traditional forensics. As such, alternative methods for collecting and analyzing such data are required.

This chapter covers the basics of forensic investigations, Big Data, and how Big Data forensics is unique. Some of the topics that are discussed include the following:

Goals of a forensic investigation
Forensic investigation methodology
Big Data – defined and described
Key differences between traditional forensics and Big Data forensics

An overview of computer forensics

Computer forensics is a field that involves the identification, collection, analysis, and presentation of digital evidence. The goals of a forensic investigation include:

Properly locating all relevant data
Collecting the data in a sound manner
Producing analysis that accurately describes the events
Clearly presenting the findings

Forensics is a technical field. As such, much of the process requires a deep technical understanding and the use of technical tools and techniques. Depending on the nature of an investigation, forensics may also involve legal considerations, such as spoliation and how to present evidence in court.

Note

Unless otherwise stated, all references to forensics, investigations, and evidence in this book is in the context of Big Data forensics.

Computer forensics centers on evidence. Evidence is a proof of fact. Evidence may be presented in court to prove or disprove a claim or issue by logically establishing a fact. Many types of legal evidence exist, such as material objects, documents, and sworn testimony. Forensic evidence falls firmly in that legal set of categories and can be presented in court. In the broader sense, forensic evidence is the informational content of and about the data.

Forensic evidence comes in many forms, such as e-mails, databases, entire filesystems, and smartphone data. Evidence can be the information contained in the files, records, and other logical data containers. Evidence is not only the contents of the logical data containers, but also the associated metadata. Metadata is any information about the data that is stored by a filesystem, content management system, or other container. Metadata is useful for establishing information about the life of the data (for example, author and last modified date).

This metadata can be combined with the data to form a story about the who, what, why, when, where, and how of the data. Evidence can also take the form of deleted files, file fragments, and the contents of in-memory data.

For evidence to be court admissible or accepted by others, the data must be properly identified, collected, preserved, documented, handled, and analyzed. While the evidence itself is paramount, the process by which the data is identified, collected, and handled is also critical to demonstrate that the data was not altered in any way. The process should adhere to the best practices accepted by the court and backed by technical standards. The analysis and presentation must also adhere to best practices for both admissibility and audience comprehension. Finally, documentation of the entire process must be maintained and available for presentation to clearly demonstrate all the steps performed—from identification to collection to analysis.

The forensic process

The forensic process is an iterative process that involves four phases: identification, collection, analysis, and presentation. Each of the phases is performed sequentially. The forensic process can be iterative for the following reasons:

Additional data sources are required
Additional analyses need to be performed
Further documentation of the identification process is needed
Other situations, as required

The following figure shows the high-level forensic process discussed in this book:

Figure 1: The forensic process

Note

This book follows the forensic process of Electronic Discovery Reference Model (EDRM), which is the industry standard and is a court-accepted best practice. The EDRM is developed and maintained by forensic and electronic discovery (e-discovery) professionals. For more information, visit EDRM's website at http://www.edrm.net/.

Tip

The sets of forensic steps and goals should be attempted to be applied for every investigation. No two investigations are the same. As such, practical realities may dictate which steps are performed and which goals can be met.

The four steps in the forensic process and the goals for each are covered in the following sections:

Identification

Identifying and fully collecting the data of interest in the early stages of an investigation is critical to any successful project. If data is not properly identified and, subsequently, is not collected, an embarrassing and difficult process of corrective efforts will be required—at a minimum—not to mention wasted time. At worst, improperly identifying and collecting data will result in working with an incorrect or incomplete set of data. In the latter case, court sanctions, a lost investigation, and ruined reputations can be expected.

The high-level approach taken in this book starts with:

Examining the organization's system architecture
Determining the kinds of data in each system
Previewing the data
Assessing which systems are to be collected

In addition, the identification phase should also include a process to triage the data sources by priority, ensuring the data sources are not subsequently used and/or modified. This approach results in documentation to back up the claim that all potentially important sources of data were examined. It also provides assurance that no major systems were overlooked. The main considerations for each source are as follows:

Data quality
Data completeness
Supporting documentation
Validating the collected data
Previous systems where the data resided
How the data enters and leaves the system
The available formats for extraction
How well the data meets the data requirements

The following figure illustrates this high-level identification process:

Figure 2: Data identification process

The primary goals for the identification stage of an investigation are as follows:

Proper identification and documentation of potentially relevant sources of evidence
Complete documentation of identified sources of information
Timely assessment of potential sources of evidence from key stakeholders

Collection

The data collection phase involves the acquisition and preservation of evidence and validation information as well as properly documenting the process. For evidence to be court admissible and usable, it needs to be collected in a defensible manner that adheres to best practices. Collecting data alone, however, is not always sufficient in an investigation. The data should be accompanied by validation information (for example, log or query files) and documentation of the collection and preservation steps performed. Together, the collected data, validation information, and documentation allow for proper analysis that can be validated and defended.

The following figure highlights the collection phase process:

Figure 3: Data collection process

Data collection is a critical phase in a digital investigation. The data analysis phase can be rerun and corrected, if needed. However, improperly collecting data may result in serious issues later during analysis, if the error is detected at all. If the error goes undetected, the improper collection will result in poor data for the analysis. For example, if the collection was only a partial collection, the analysis results may understate the actual values. If the improper collection is detected during the analysis process, recollecting data may be impossible. This is the case when the data has been subsequently purged or is no longer available because the owner of the data will not permit access to the data again. In short, data collection is critical for later phases of the investigation, and there may not be opportunities to perform it again.

Data can be collected using several different methods. These methods are as follows:

Physical collection: A physical acquisition of every bit, which may be done across specific containers, volumes, or devices. The collection is an exact replica of every bit of data and metadata. Slack space and deleted files can be recovered using this method.
Logical collection: An acquisition of active data. The collection is a replica of the informational content and metadata, but is not a bit-by-bit collection.
Targeted collection: A collection of specific containers, volumes, or devices.

Each of the methods is covered in this book. Validation information serves as a means for proving what was collected, who performed the collection, and how all relevant data was captured. Validation is also crucial to the collection phase and later stages of an investigation. Collecting the relevant data is the primary goal of any investigation, but the validation information is critical for ensuring that the relevant data was collected properly and not modified later. Obviously, without the data, the entire process is moot.

A closely-related goal is to collect the validation information along with the data. The primary forms of validation information are MD5/SHA-1 hash values, system and process logs, and control totals. Both MD5 and SHA-1 are hash algorithms that generate a unique value based on the contents of the file that serves as a fingerprint and can be used to authenticate evidence. If a file is modified, the MD5 or SHA-1 of the modified file will not match the original. In fact, generating two different files with the same value is virtually impossible. For this reason, forensic investigators rely on MD5 or SHA-1 to prove that the evidence was successfully collected and that the data analyzed matches the original source data. Control totals are another form of validation information, which are values computed from a structured data source—such as the number of rows or sum value of a numeric field. All collected data should be validated in some manner during the collection phase before moving into the analysis.

Note

Collect validation information simultaneously during or immediately after collecting evidence to ensure accurate and reliable validation.

The goals of the collection phase are as follows:

Forensically sound collection of relevant sources of evidence utilizing technical best practices and adhering to legal standards
Full, proper documentation of the collection process
Collection of verification information (for example, MD5 or control totals)
Validation of collected evidence
Maintenance of chain of custody

Analysis

The analysis phase is the process by which collected and validated evidence is examined to gather and assemble the facts of an investigation. Many tools and techniques exist for converting the volumes of evidence into facts. In some investigations, the requirements clearly and directly point to the types of evidence and facts that are needed. These investigations may involve only a small amount of data or the issues are straightforward. For example, they only require a specific e-mail or only a small timeframe is in question. Other investigations, however, are large and complex. The requirements do not clearly identify a direct path of inquiry. The tools and techniques in the analysis phase are designed for both types of investigations and guide the inquiry.

The process for analyzing forensic evidence is dependent on the requirements of the investigation. Every case is different, so the analysis phase is both a science and an art. Most investigations are bounded by some known facts, such as a specific timeframe or the individuals involved. The analysis for such bounded investigations can begin by focusing on data from those time periods or involving those individuals. From there, the analysis can expand to include other evidence for corroboration or a new focus. Analysis can be an iterative process of investigating a subset of information. Analysis can also focus on one theory but then expand to either include new evidence or to form a new theory altogether. Regardless, the analysis should be completed within the practical confines of the investigation.

Two of the primary ways in which forensic analysis is judged are completeness and bias. Completeness, in forensics, is a relative term based on whether the relevant data has been reasonably considered and analyzed. Excluding relevant evidence or forms of analysis harms the credibility of the analysis. The key point is the reasonableness of including or excluding evidence and analysis. Bias is closely related to completeness. Bias is prejudice towards or against a particular thing. In the case of forensic analysis, bias is an inclination to favor a particular line of thinking without giving equal weight to other theories. Bias should be eliminated or minimized as much as possible when performing analysis to guarantee completeness and objective analysis. Both completeness and bias are covered in subsequent chapters.

Another key concept is data reduction. Forensic investigations can involve terabytes of data and millions of files and other data points. The practical realities of an investigation may not allow for a complete analysis of all data. Techniques exist for reducing the volume of data to a more manageable amount. This is performed using known facts and data interrelatedness to triage data by priority or eliminate data from the set of data to be analyzed.

Cross-validation is the use of multiple analyses or pieces of evidence to corroborate analysis. This is a key concept in forensics. While not always possible, cross-validation adds veracity to findings by further proving the likelihood that a finding is true. Cross-validation should be performed by independently testing two data sets or forms of analysis and confirming that the results are consistent.

The types of analysis performed depend on a number of factors. Forensic investigators have an arsenal of tools and techniques for analyzing evidence, and those tools and techniques are chosen based on the requirements of the investigation and the types of evidence. One example is timeline analysis, which is a technique used when chronology is important and chronological information exists and can be established. Timeline analysis is not important in all investigations, so it is not useful in every investigation.

In other cases, pattern analysis or anomaly detection may be required. While some investigations only require a single tool or technique, most investigations require a combination of tools and techniques. Later chapters include information about the various tools and techniques and how to select the proper ones. The following questions can help an investigator determine which tools and techniques to choose:

What are the requirements of the investigation?
What practical limitations exist?
What information is available?
What is already known about the evidence?

Documentation of findings and the analysis process must be carefully maintained throughout the process. Forensic evidence is complex. Analyzing forensic evidence can be even more complex. Without proper documentation, the findings are unclear and not defensible. An investigator can go down a path of analyzing data and related information—sometimes, linking hundreds of findings—and without documentation, detailing the full analysis is impossible. To avoid this, an investigator needs to carefully detail the evidence involved, the analysis performed, the analysis findings, and the interrelationships between multiple analyses.

The primary goals of the analysis phase are as follows:

Unbiased and objective analysis
Reduction of data complexity
Cross-validation of findings
Application of accepted standards

Presentation

The final phase in the forensic process is the presentation of findings. The findings can be presented in a number of different ways, such as a written expert report, graphical presentations, or testimony. Regardless of the format, the key to a successful presentation is to clearly demonstrate the findings and the process by which the findings were derived. The process and findings should be presented in a way that the audience can easily understand. Not every piece of information about the process phases or findings needs to be presented. Instead, the focus should be on the critical findings at a level of detail that is sufficiently thorough. Documentation, such as chain of custody forms, may not need to be included but should still be available should the need arise.

The goals of the presentation phase are as follows:

Clear, compelling evidence
Analysis that separates the signal from the noise
Proper citation of source evidence
Availability of chain of custody and validation documentation
Post-investigation data management

Other investigation considerations

This book details the majority of the EDRM forensic process. However, investigators should be aware of several additional considerations not covered in detail in this book. Forensics is a large field with many technical, legal, and procedural considerations. Covering every topic would span multiple volumes. As such, this book does not attempt to cover all concepts. The following sections highlight several key concepts that a forensic investigator should consider—equipment, evidence management, investigator training, and the post-investigation process.

Equipment

Forensic investigations require specialized equipment for the collection and processing of evidence. Source data can reside on a host of different types of systems and devices. An investigator may need to collect several different types of systems. These include cell phones, mainframe computers, laptops with various operating systems, and database servers. These devices have different hardware and software connectors, different means of accessing, different configurations, and so on. In addition, an investigator must be careful not to alter or destroy evidence in the collection process. A best practice is to employ write-blocker software or physical devices to ensure that evidence is preserved in its original state. In some instances, specialized forensic equipment should be used to perform the collections, such as forensic devices that connect to smartphones for acquisitions. Big Data investigations rarely involve this specialized equipment to collect the data, but encrypted drives and other forensic devices may be used. Forensic investigators should be knowledgeable about the required equipment and come prepared to collect data with a forensic kit that contains the required equipment.

Evidence management

The management of forensic evidence is also critical to maintaining proper control and security of the evidence. Forensic evidence, once collected, requires careful handling, storage, and documentation. A standard practice in forensics is to create and maintain chain of custody of all evidence. Chain of custody documentation is a chronological description that details the collection, handling, transfer, analysis, and destruction of evidence. The chain of custody is established when a forensic investigator first acquires the data. The documentation details the collection process and then serves as a log of all individuals who take possession of the evidence, when that person had possession of the evidence, and details about what was done to the evidence. Chain of custody documentation should always reflect the full history and current status of the evidence. Chain of custody is further discussed in later chapters.

Only authorized individuals should have access to the evidence. Evidence integrity is critical for establishing and maintaining the veracity of findings. Allowing unauthorized—or undocumented—access to evidence can cast doubt on whether the evidence was altered. Even if the MD5 hash values are later found to match, allowing unauthorized access to the evidence can be enough to call the investigative process into question.

Security is important for preventing unauthorized access to both original evidence and analysis. Physical and digital security both play important roles in the overall security of evidence. The security of evidence should cover the premises, the evidence locker, any device that can access the analysis server, and network connections. Forensic investigators should be concerned with two types of security: physical security and digital security.

Physical security is the collection of devices, structural design, processes, and other means for ensuring that unauthorized individuals cannot access, modify, destroy, or deny access to the data. Examples of physical security include locks, electronic fobs, and reinforced walls in the forensic lab.
Digital security is the set of measures to protect the evidence on devices and on a network. Evidence can contain malware that could infect the analysis machine. A networked forensic machine that collects evidence remotely can potentially be penetrated. Examples of digital security include antivirus software, firewalls, and ensuring that forensic analysis machines are not connected to a network.

Investigator training and certification

Forensic investigators are often required to take forensic training and maintain current certifications in order to conduct investigations and testify to the results. While this is not always required, investigators can further prove that he has proper technical expertise by way of such training and certification. Forensic investigators are forensic experts, so that expertise should be documented and provable should anyone question their credentials. This can be achieved in part by way of training and certification.

The post-investigation process

After an investigation concludes, the evidence and analysis findings need to be properly archived or destroyed. Criminal and civil investigations require that evidence be maintained for a mandated period of time. The investigator should be aware of such retention rules and ensure that evidence is properly and securely archived and maintained for that period of time. In addition, documentation and analysis should be retained as well to guarantee that the results of the investigation are not lost and to prevent issues arising from questions about the evidence (for example, chain of custody).

What is Big Data?

Big Data describes the tools and techniques used to manage and process data that traditional means cannot easily accomplish. Many factors have led to the need for Big Data solutions. These include the recent proliferation of data storage, faster and easier data transfer, increased awareness of the value of data, and social media. Big Data solutions were needed to address the rapid, complex, and voluminous data sets that have been created in the past decade. Big Data can be structured data (for example, databases), unstructured data (such as e-mails), or a combination of both.

The four Vs of Big Data

A widely-accepted set of characteristics of Big Data is the four Vs of data. In 2001, Doug Laney of META Group produced a report on the needs of the changing requirements for managing the forms of voluminous data. In this report, he defined the three Vs of data: volume, velocity, and variety. These factors address the following:

The large data sets
The increased speed at which the data arrives, requires storage, and should be analyzed
The multitude of forms the data, such as financial records, e-mails, and social media data

This definition has been expanded to include a fourth V for veracity—the trustworthiness of the data quality and the data's source.

Tip

One way to identify whether a data set is Big Data is to consider the four Vs.

Volume is the most obvious characteristic of Big Data. The amount of data produced has grown exponentially over the past three decades, and that growth has been fueled by better and faster communications networks and cheaper storage. In the early 1980s, a gigabyte of storage costs over $200,000. A gigabyte of storage today costs approximately $0.06. This massive drop in storage costs and the highly networked nature of devices provides a means to create and store massive volumes of data. The computing industry now talks about the realities of exabytes (approximately, one billion gigabytes) and zettabytes (approximately, one trillion gigabytes) of data—possibly even yottabytes (over a thousand trillion gigabytes). Data volumes have obviously grown, and Big Data solutions are designed to handle the voluminous data sets through distributed storage and computing to scale out to the growing data volumes. The distributed solutions provide a means for storing and analyzing massive data volumes that could not feasibly be stored or computer by a single device.

Velocity is another characteristic of Big Data. The value of the information contained in data has placed an increased emphasis on quickly extracting information from data. The speed at which social media data, financial transactions, and other forms of data are being created can outpace traditional analysis tools. Analyzing real-time social media data requires specialized tools and techniques for quickly retrieving, storing, transforming, and analyzing the information. Tools and techniques designed to manage high-speed data also fall into the category of Big Data solutions.

Variety is the third V of Big Data. A multitude of different forms of data are being produced. The new emphasis is on extracting information from a host of different data sources. This means that traditional analysis is not always sufficient. Video files and their metadata, social media posts, e-mails, financial records, and telephonic recordings may all contain valuable information, and the data need to be analyzed in conjunction with one another. These different forms of data are not easily analyzed using traditional means.

Traditional data analysis focuses on transactional data or so-called structured data for analysis in a relational or hierarchical database. Structured data has a fixed composition and adheres to rules about what types of values it can contain. Structured data are often thought of in terms of records or rows, each with a set of one or more columns or fields. The rows and columns are bound by defined properties, such as the data type and field width limitations. The most common forms of structured data are:

Database records
Comma-Separated Value (CSV) files
Spreadsheets

Traditional analysis is performed on structured data using databases, programs, or spreadsheets to load the data into a fixed format and run a set of commands or queries on the data. SQL has been the standard database language for data analysis over the past two decades—although many other languages and analysis packages exist.

Unstructured and semi-structured data do not have the same fixed data structure rules and do not lend themselves well to traditional analysis. Unstructured data is data that is stored in a format that is not expressly bound by the same data format and content rules as structured data. Several examples of unstructured data are:

E-mails
Video files
Presentation documents

Note

According to VMWare's 2013 Predictions for Big Data, over 80% of data produced will be unstructured, and the growth rate of unstructured data is 50-60% per year.

Semi-structured data is data that has rules for the data format and structure, but those rules are too loose for easy analysis using traditional means for analyzing structured data. XML is the most common form of semi-structured data. XML has a self-describing structure, but the structure of one XML file is not adhered to across all other XML files.

The variety of Big Data comes from the incorporation of a multitude of different types of data. Variety can mean incorporating structured, semi-structured, and unstructured data, but it can also mean simply incorporating various forms of structured data. Big Data solutions are designed to analyze whatever type of data is required. Regardless of the types of data are incorporated, the challenge for Big Data solutions is being able to collect, store, and analyze various forms of data in a single solution.

Veracity is the fourth V of Big Data. Veracity, in terms of data, indicates whether the informational content of data can be trusted. With so many new forms of data and the challenge of quickly analyzing a massive data set, how does one trust that the data is properly formatted, has correct and complete information, and is worth analyzing? Data quality is important for any analysis. If the data is lacking in some way, all the analyses will be lacking. Big Data solutions address this by devising techniques for quickly assessing the data quality and appropriately incorporating or excluding the data based on the data quality assessment results.

Big Data architecture and concepts

The architectures for Big Data solutions vary greatly, but several core concepts are shared by most solutions. Data is collected and ingested in Big Data solutions from a multitude of sources. Big Data solutions are designed to handle various types and formats of data, and the various types of data can be ingested and stored together. The data ingestion system brings the data in for transformation before the data is sent to the storage system. Distribution of storage is important for the storage of massive data sets. No single device can possibly store all the data or be expected to not experience failure as a device or on one of its disks. Similarly, computational distribution is critical for performing the analysis across large data sets with timeliness requirements. Typically, Big Data solutions enact a master/worker system—such as MapReduce—whereby one computational system acts as the master to distribute individual analyses for the worker computational systems to complete. The master coordinates and manages the computational tasks and ensures that the worker systems complete the tasks.

The following figure illustrates a high-level Big Data architecture:

Figure 4: Big Data overview

Big Data solutions utilize different types of databases to conduct the analysis. Because Big Data can include structured, semi-structured, and/or unstructured data, the solutions need to be capable of performing the analysis across various types of files. Big Data solutions can utilize both relational and nonrelational database systems. NoSQL (Not only SQL) databases are one of the primary types of nonrelational databases used in Big Data solutions. NoSQL databases use different data structures and query languages to store and retrieve information. Key-value, graph, and document structures are used by NoSQL. These types of structures can provide a better and faster method for retrieving information about unstructured, semi-structured, and structured data.

Two additional important and related concepts for many Big Data solutions are text analytics and machine learning. Text analytics is the analysis of unstructured sets of textual data. This area has grown in importance with the surge in social media content and e-mail. Customer sentiment analysis, predictive analysis on buyer behavior, security monitoring, and economic indicator analysis are performed on text data by running algorithms across their data. Text analytics is largely made possible by machine learning. Machine learning is the use of algorithms and tools to learn from data. Machine algorithms make decisions or predictions from data inputs without the need for explicit algorithm instructions.

Video files and other nontraditional analysis input files can be analyzed in a couple ways:

Using specialized data extraction tools during data ingestion
Using specialized techniques during analysis

In some cases, only the unstructured data's metadata is important. In others, content from the data needs to be captured. For example, feature extraction and object recognition information can be captured and stored for later analysis. The needs of the Big Data system owner dictate the types of information captured and which tools are used to ingest, transform, and analyze the information.

Big Data forensics

The changes to the volumes of data and the advent of Big Data systems have changed the requirements of forensics when Big Data is involved. Traditional forensics relies on time-consuming and interruptive processes for collecting data. Techniques central to traditional forensic include removing hard drives from machines containing source evidence, calculating MD5/SHA-1 checksums, and performing physical collections that capture all metadata. However, practical limitations with Big Data systems prevent investigators from always applying these techniques. The differences between traditional forensics and forensics for Big Data are covered and explained in this section.

One goal of any type of forensic investigation is to reliably collect relevant evidence in a defensible manner. The evidence in a forensic investigation is the data stored in the system. This data can be the contents of a file, metadata, deleted files, in-memory data, hard drive slack space, and other forms. Forensic techniques are designed to capture all relevant information. In certain cases—especially when questions about potentially deleted information exist—the entire filesystem needs to be collected using a physical collection of every individual bit from the source system. In other cases, only the informational content of a source filesystem or application system are of value. This situation arises most commonly when only structured data systems—such as databases—are in question, and metadata or slack space are irrelevant or impractical to collect. Both types of collection are equally sound; however, the application of the type of collection depends on both practical considerations and the types of evidence required for collection.

Big Data forensics is the identification, collection, analysis, and presentation of the data in a Big Data system. The practical challenges of Big Data systems aside, the goal is to collect data from distributed filesystems, large-scale databases, and the associated applications. Many similarities exist between traditional forensics and Big Data forensics, but the differences are important to understand.

Tip

Every forensic investigation is different. When choosing how to proceed with collecting data, consider the investigation requirements and practical limitations.

Metadata preservation

Metadata is any information about a file, data container, or application data that describes its attributes. Metadata provides information about the file that may be valuable when questions arise about how the file was created, modified, or deleted. Metadata can describe who altered a file, when a file was revised, and which system or application generated the data. These are crucial facts when trying to understand the life cycle and story of an individual file.

Metadata is not always crucial to a Big Data investigation. Metadata is often altered or lost when data flows into and through a Big Data system. The ingestion engines and data feeds collect the data without preserving the metadata. The metadata would thus not provide information about who created the data, when the data was last altered in the upstream data source, and so on. Collecting information in these cases may not serve a purpose. Instead, upstream information about how the data was received can be collected as an alternative source of detail.

Investigations into Big Data systems can hinge on the information in the data and not the metadata. Like structured data systems, metadata does not serve a purpose when an investigation is solely based on the content of the data. Quantitative and qualitative questions can be answered by the data itself; metadata in that case would not be useful, so long as the collection was performed properly and no questions exist about who imported and/or altered the data in the Big Data system. The data within the systems is the only source of information.

Tip

Collecting upstream information from application logs, source systems, and/or audit logs can be used in place of metadata collection.

Collection methods

Big Data systems are large, complex systems with business requirements. As such, they may not be able to be taken offline for a forensic investigation. In traditional forensics, systems can be taken offline, and a collection is performed by removing the hard drive to create a forensic copy of the data. In Big Data investigations, hundreds or thousands of storage hard drives may be involved, and data is lost when the Big Data system is brought offline. Also, the system may need to stay online due to business requirements. Big Data collections usually require logical and targeted collection methods by way of logical file forensic copies and query-based collection.

Collection verification

Traditional forensics relies on MD5 and SHA-1 to verify the integrity of the data collected, but it is not always feasible to use hashing algorithms to verify Big Data collections. Both MD5 and SHA-1 are disk-access intensive. Verifying collections by computing an MD5 or SHA-1 hash comprises a large percentage of the time dedicated to collecting and verifying source evidence. Spending the time to calculate the MD5 and SHA-1 for a Big Data collection may not be feasible when many terabytes of data are collected. The alternative is to rely on control totals, collection logs, and other descriptive information to verify the collection.

Summary

This book is an introduction to the key concepts and current technologies involved in Big Data forensics. Big Data is a paradigm shift in how data is stored and managed, and the same is true for forensic investigations of Big Data. A foundational understanding of computer forensics is important to understand the process and methods used in investigating digital information. Designed as a how-to guide, this book provides practical guidance on how to conduct investigations utilizing current technology and tools. Rather than rely on general principles or proprietary software, this books presents practical solutions utilizing freely-available software where possible. Several commercial software packages are also discussed to provide guidance and other ideas on how to tackle Big Data forensics investigations.

The field of forensics is large and continues to evolve. The field is new, and the technologies continue to change and develop. The constant growth in Big Data technologies leads to change in the tools and technologies for forensic investigations. Most of the tools presented in this book were developed in the past five years. Regardless of the tools used, this book is designed to provide readers with practical guidance on how to conduct investigations and select the appropriate tools.

This book focuses on performing forensics on Hadoop systems and Hadoop-based data. Hadoop is a framework for Big Data, and many software packages are built on top of Hadoop. This book covers the Hadoop filesystem and several of the key software packages that are built on top of Hadoop, such as Hive and HBase. A freely available Linux-based Hadoop virtual machine, LightHadoop, is used in this book to present examples of collecting and analyzing Hadoop data that can be followed by the reader.

Each of the stages of the forensic process is discussed in detail using practical Hadoop examples. Chapter 2, Understanding Hadoop Internals and Architecture details the Hadoop architecture and installing LightHadoop as a test environment. The remaining chapters cover each of the phases of the forensic process and the most common Hadoop packages that a forensic investigator will encounter.