Overview
Teaching: 10 min
Exercises: 5 minQuestions
What is OpenRefine useful for?
How can we bring our data into OpenRefine?
Objectives
Describe OpenRefine’s uses and applications.
Differentiate data cleaning from data organization.
Create a new OpenRefine project from a CSV file.
Experiment with OpenRefine’s user interface.
Locate helpful resources to learn more about OpenRefine.
Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.
Follow the Setup instructions to install OpenRefine. Note: if you are going to be running large datasets (using more than 3GB of RAM), you may need to install a Java memory extension, which you can install using these instructions.
If after installation, open the OpenRefine.exe file which will open a command window and your default browser. If it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Start the program. (Double-click on the openrefine.exe file. Java services will start on your machine, and OpenRefine will open in your browser).
Launch OpenRefine.
OpenRefine can import a variety of file types, including tab separated (tsv
), comma separated (csv
), Excel (xls
, xlsx
), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.
In this first step, we’ll browse our computer to the sample data file for this lesson.
If you haven’t already, download the data from:
Google Sheets
Once OpenRefine is launched in your browser, the left margin has options to Create Project
, Open Project
, or Import Project
. Here we will create a new project:
Exercise
- click
Create Project
and selectGet data from
This Computer
.- Click
Choose Files
and select the fileOpenRefine Workshop Sample Scopus Data.csv
. ClickOpen
or double-click on the filename.- Click
Next>>
under the browse button to upload the data into OpenRefine.- OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click
Update Preview
(bottom left). If this is the wrong file, click<<Start Over
(upper left).- If all looks well, click
Create Project>>
(upper right).
Note that at step 1, you could upload data in a standard form from a web address by selecting
Get data from
Web Addresses (URLs)
. » However, this won’t work for all URLs.
OpenRefine displays data in a tabular format. Each row will represent a ‘record’ in the data, while each column represents a type of information. This is very similar to how you might view data in a spreadsheet or database, and individual bits of data are housed in ‘cells’ at the intersection of a row and a column. Most of the actions that you can perform in OpenRefine are centered around filtering, adjusting, cleaning, or manipulating data in each ‘record’ or column to meet your needs.
OpenRefine only displays a certain number of records at a time. The default display is 10 rows, but this can be changed to 5, 25, or 50 (keep in mind that the greater number of rows displayed, the longer it may take data with lots of columns to load) by selecting the appropriate display in the top right corner. You can navigate through the rows by using the First/Previous/Next/Last arrows in the top right corner of the screen. The left side of the screen displays options for Faceting, and the Undo/Redo options available.
You can find out a lot more about OpenRefine at http://openrefine.org and check out some great introductory videos. There is a Google Group that can answer a lot of beginner questions and problems. There is also an OpenRefine Google Plus community where you can find a lot of help and a lot of folks from the life sciences are members. As with other programs of this type, OpenRefine libraries are available too, where you can find a script you need and copy it into your OpenRefine instance to run it on your dataset.
Key Points
OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
OpenRefine will automatically track any steps you take in working with your data.