Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Using OpenRefine

You're reading from  Using OpenRefine

Product type Book
Published in Sep 2013
Publisher Packt
ISBN-13 9781783289080
Pages 114 pages
Edition 1st Edition
Languages

Introducing OpenRefine


Let's face a hard fact: your data are messy. All data are messy. Errors will always creep into large datasets no matter how much care you have put into creating them, especially when their creation has involved several persons and/or has been spread over a long timespan. Whether your data are born-digital or have been digitized, whether they are stored in a spreadsheet or in a database, something will always go awry somewhere in your dataset.

Acknowledging this messiness is the first essential step towards a sensible approach to data quality, which mainly involves data profiling and cleaning.

Data profiling is defined by Olson (Data Quality: The Accuracy Dimension, Jack E. Olson, Morgan Kaufman, 2003) as "the use of analytical techniques to discover the true structure, content, and quality of data". In other words, it is a way to get an assessment of the current state of your data and information about errors that they contain.

Data cleaning is the process that tries to correct those errors in a semi-automated way by removing blanks and duplicates, filtering and faceting rows, clustering and transforming values, splitting multi-valued cells, and so on.

Whereas custom scripts were formerly needed to perform data profiling and cleaning tasks, often separately, the advent of Interactive Data Transformation tools (IDTs) now allows for quick and inexpensive operations on large amounts of data inside a single integrated interface, even by domain professionals lacking in-depth technical skills.

OpenRefine is such an IDT; a tool for visualizing and manipulating data. It looks like a traditional, Excel-like spreadsheet software, but it works rather like a database, that is, with columns and fields rather than individual cells. This means that OpenRefine is not well suited for encoding new rows of data, but is extremely powerful when it comes to exploring, cleaning, and linking data.

The recipes gathered in this first chapter will help you to get acquainted with OpenRefine by reviewing its main functionalities, from import/export to data exploration and from history usage to memory management.

You have been reading a chapter from
Using OpenRefine
Published in: Sep 2013 Publisher: Packt ISBN-13: 9781783289080
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}