Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • What is OpenRefine useful for?

  • How can we bring our data into OpenRefine?

Objectives
  • Describe OpenRefine’s uses and applications.

  • Differentiate data cleaning from data organization.

  • Create a new OpenRefine project from a CSV file.

  • Experiment with OpenRefine’s user interface.

  • Locate helpful resources to learn more about OpenRefine.

Lesson

Features

Motivations for the OpenRefine Lesson

Before we get started

Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.

Follow the Setup instructions to install OpenRefine. Note: if you are going to be running large datasets (using more than 3GB of RAM), you may need to install a Java memory extension, which you can install using these instructions.

If after installation, open the OpenRefine.exe file which will open a command window and your default browser. If it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Creating a Project

Start the program. (Double-click on the openrefine.exe file. Java services will start on your machine, and OpenRefine will open in your browser).

Launch OpenRefine.

OpenRefine can import a variety of file types, including tab separated (tsv), comma separated (csv), Excel (xls, xlsx), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.

In this first step, we’ll browse our computer to the sample data file for this lesson.

If you haven’t already, download the data from:
Google Sheets

Once OpenRefine is launched in your browser, the left margin has options to Create Project, Open Project, or Import Project. Here we will create a new project:

Alt text

Exercise

  1. click Create Project and select Get data from This Computer.
  2. Click Choose Files and select the file OpenRefine Workshop Sample Scopus Data.csv. Click Open or double-click on the filename.
  3. Click Next>> under the browse button to upload the data into OpenRefine.
  4. OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click Update Preview (bottom left). If this is the wrong file, click <<Start Over (upper left).
  5. If all looks well, click Create Project>> (upper right).

Note that at step 1, you could upload data in a standard form from a web address by selecting Get data from Web Addresses (URLs). » However, this won’t work for all URLs.

Layout

OpenRefine displays data in a tabular format. Each row will represent a ‘record’ in the data, while each column represents a type of information. This is very similar to how you might view data in a spreadsheet or database, and individual bits of data are housed in ‘cells’ at the intersection of a row and a column. Most of the actions that you can perform in OpenRefine are centered around filtering, adjusting, cleaning, or manipulating data in each ‘record’ or column to meet your needs.

OpenRefine only displays a certain number of records at a time. The default display is 10 rows, but this can be changed to 5, 25, or 50 (keep in mind that the greater number of rows displayed, the longer it may take data with lots of columns to load) by selecting the appropriate display in the top right corner. You can navigate through the rows by using the First/Previous/Next/Last arrows in the top right corner of the screen. The left side of the screen displays options for Faceting, and the Undo/Redo options available.

Additional Resources

You can find out a lot more about OpenRefine at http://openrefine.org and check out some great introductory videos. There is a Google Group that can answer a lot of beginner questions and problems. There is also an OpenRefine Google Plus community where you can find a lot of help and a lot of folks from the life sciences are members. As with other programs of this type, OpenRefine libraries are available too, where you can find a script you need and copy it into your OpenRefine instance to run it on your dataset.

Key Points