Reader small image

You're reading from  Julia for Data Science

Product typeBook
Published inSep 2016
Reading LevelBeginner
PublisherPackt
ISBN-139781785289699
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Anshul Joshi
Anshul Joshi
author image
Anshul Joshi

Anshul Joshi is a data scientist with experience in recommendation systems, predictive modeling, neural networks, and high performance computing. His research interests encompass deep learning, artificial intelligence, and computational physics. Most of the time, he can be caught exploring GitHub or trying anything new he can get his hands on. You can also follow his personal blog.
Read more about Anshul Joshi

Right arrow

Chapter 2. Data Munging

It is said that around 50% of the data scientist's time goes into transforming raw data into a usable format. Raw data can be in any format or size. It can be structured like RDBMS, semi-structured like CSV, or unstructured like regular text files. These contain some valuable information. And to extract that information, it has to be converted into a data structure or a usable format from which an algorithm can find valuable insights. Therefore, usable format refers to the data in a model that can be consumed in the data science process. This usable format differs from use case to use case.

This chapter will guide you through data munging, or the process of preparing the data. It covers the following topics:

  • What is data munging?

  • DataFrames.jl

  • Uploading data from a file

  • Finding the required data

  • Joins and indexing

  • Split-Apply-Combine strategy

  • Reshaping the data

  • Formula (ModelFrame and ModelMatrix)

  • PooledDataArray

  • Web scraping

What is data munging?


Munging comes from the term "munge," which was coined by some students of Massachusetts Institute of Technology, USA. It is considered one of the most essential parts of the data science process; it involves collecting, aggregating, cleaning, and organizing the data to be consumed by the algorithms designed to make discoveries or to create models. This involves numerous steps, including extracting data from the data source and then parsing or transforming the data into a predefined data structure. Data munging is also referred to as data wrangling.

The data munging process

So what's the data munging process? As mentioned, data can be in any format and the data science process may require data from multiple sources. This data aggregation phase includes scraping it from websites, downloading thousands of .txt or .log files, or gathering the data from RDBMS or NoSQL data stores.

It is very rare to find data in a format that can be used directly by the data science process...

What is a DataFrame?


A DataFrame is a data structure that has labeled columns, which individually may have different data types. Like a SQL table or a spreadsheet, it has two dimensions. It can also be thought of as a list of dictionaries, but fundamentally, it is different.

DataFrames are the recommended data structure for statistical analysis. Julia provides a package called DataFrames.jl , which have all necessary functions to work with DataFrames.

Julia's package, DataFrames, provides three data types:

  • NA: A missing value in Julia is represented by a specific data type, NA.

  • DataArray: The array type defined in the standard Julia library, though it has many features, doesn't provide any specific functionalities for data analysis. DataArray provided in DataFrames.jl provides such features (for example, if we required to store in an array some missing values).

  • DataFrame: DataFrame is 2-D data structure, like spreadsheets. It is much like R or pandas's DataFrames, and provides many functionalities...

Summary


In this chapter, we learned what data munging is and why it is necessary for data science. Julia provides functionalities to facilitate data munging with the DataFrames.jl package, with features such as these:

  • NA: A missing value in Julia is represented by a specific data type, NA.

  • DataArray: DataArray provided in the DataFrames.jl provides features such as allowing us to store some missing values in an array.

  • DataFrame: DataFrame is 2-D data structure like spreadsheets. It is very similar to R or pandas's dataframes, and provides many functionalities to represent and analyze data. DataFrames has many features well suited for data analysis and statistical modeling.

  • A dataset can have different types of data in different columns.

  • Records have a relation with other records in the same row of different columns of the same length.

  • Columns can be labeled. Labeling helps us to easily become familiar with the data and access it without the need to remember their numerical indices.

We learned...

References


lock icon
The rest of the chapter is locked
You have been reading a chapter from
Julia for Data Science
Published in: Sep 2016Publisher: PacktISBN-13: 9781785289699
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Anshul Joshi

Anshul Joshi is a data scientist with experience in recommendation systems, predictive modeling, neural networks, and high performance computing. His research interests encompass deep learning, artificial intelligence, and computational physics. Most of the time, he can be caught exploring GitHub or trying anything new he can get his hands on. You can also follow his personal blog.
Read more about Anshul Joshi