Overview
Teaching: 15 min
Exercises: 10 minQuestions
How can we find and correct errors in our raw data?
How can we separate our data?
Objectives
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.
Parse data by utilizing a regular expression.
If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the pieces into their own columns.
Authors with Affiliations
column into separate columns for each Author and their Affiliation.Authors with Affiliations
column. Choose Edit Column
> Split into several columns...
Separator
box, enter a semi-colon.Remove this column
.OK
. You’ll get some new columns called Authors with Affiliations 1
, Authors with Affiliations 2
, and so on.Exercise
Try to change the name of the second new column to “Authors”. How can you correct the problem you encounter?
Solution
On the
Authors with Affiliations 2
column, click the down arrow and thenEdit column
>Rename this column
. Type “Authors” into > > the box that appears. A pop-up will appear that saysAnother column already named Authors
. This is because there is another column where we’ve recorded all of the author names. You can choose another name likeAuthor Affiliation
orauthor_affiliation
for this > > column.
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo
and Redo
operations to make this easy.
Undo / Redo
on the left side of the screen. All the changes you have made so far are listed here, with the newest changes at the bottom of the list.Authors with Affiliations
is not split.Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Words with spaces at the beginning or end are particularly hard for us humans to tell from strings without them, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.
Authors with Affiliations
, choose Edit cells
> Common transforms
> Trim leading and trailing whitespace
.Split
step has now disappeared from the Undo / Redo
pane on the left and is replaced with a Text transform on # cells
Split
operation on Authors with Affiliations
that you undid earlier.Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
If specific data in a column needs to be parsed out from the rest of the data in a column, you can use the OpenRefine feature called “Add Column Based on This Column” which will allow you to use simple language, or more complex language, to create a new column with the desired data.
Exercise
- Let us suppose we want to parse the
Authors with Affiliations
columns we created earlier so that only the author’s affiliation is listed.- Click the down arrow at the top of the
Authors
column. ChooseEdit Column
>Add Column Based on This Column...
- In the pop-up, in the
New Column Name
box, enter author_institution.- Copy/paste the below text into the available box (this text is called a regular expression):
import re pattern = re.compile(r"((university|college|institute).+?),", re.I) list = [] for i in pattern.findall(value): list.append(i[0]) return ":::".join(list)
- In the dropdown menu, select the option that says
Python/Jython
.- Click
OK
. You’ll get some new columns calledauthor_institution 1
,author_institution 2
, and so on.- Notice that in some cases there are multiple affiliations listed in the same cell. Why is this? What do you think we can do to fix this?
Solution
The regular expression that we used searches for specific keywords and then parses or “pulls” out the text surrounding those keywords. This means that if an author listed more than one affiliation, the expression will pull all of them out of the original text and join them back together in the new column we created. Let’s move onto the next lesson, where we discuss Faceting and Clustering, to find out how we can fix this issue.
Key Points
Removing leading and trailing whitespace from data can make for easier searching and sorting.
Parsing data using regular expressions, which can be simple or complex, can remove unwanted text quickly.