Overview
Teaching: 15 min
Exercises: 10 minQuestions
How can we find and correct errors in our raw data?
How can we separate our data?
Objectives
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.
Parse data by utilizing a regular expression.
If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the pieces into their own columns.

Authors with Affiliations column into separate columns for each Author and their Affiliation.Authors with Affiliations column. Choose Edit Column > Split into several columns...Separator box, enter a semi-colon.Remove this column.OK. You’ll get some new columns called Authors with Affiliations 1, Authors with Affiliations 2, and so on.Exercise
Try to change the name of the second new column to “Authors”. How can you correct the problem you encounter?
Solution
On the
Authors with Affiliations 2column, click the down arrow and thenEdit column>Rename this column. Type “Authors” into > > the box that appears. A pop-up will appear that saysAnother column already named Authors. This is because there is another column where we’ve recorded all of the author names. You can choose another name likeAuthor Affiliationorauthor_affiliationfor this > > column.
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo and Redo operations to make this easy.
Undo / Redo on the left side of the screen. All the changes you have made so far are listed here, with the newest changes at the bottom of the list.Authors with Affiliations is not split.Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Words with spaces at the beginning or end are particularly hard for us humans to tell from strings without them, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.
Authors with Affiliations, choose Edit cells > Common transforms > Trim leading and trailing whitespace.Split step has now disappeared from the Undo / Redo pane on the left and is replaced with a Text transform on # cellsSplit operation on Authors with Affiliations that you undid earlier.Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
If specific data in a column needs to be parsed out from the rest of the data in a column, you can use the OpenRefine feature called “Add Column Based on This Column” which will allow you to use simple language, or more complex language, to create a new column with the desired data.
Exercise
- Let us suppose we want to parse the
Authors with Affiliationscolumns we created earlier so that only the author’s affiliation is listed.- Click the down arrow at the top of the
Authorscolumn. ChooseEdit Column>Add Column Based on This Column...- In the pop-up, in the
New Column Namebox, enter author_institution.- Copy/paste the below text into the available box (this text is called a regular expression):
import re pattern = re.compile(r"((university|college|institute).+?),", re.I) list = [] for i in pattern.findall(value): list.append(i[0]) return ":::".join(list)
- In the dropdown menu, select the option that says
Python/Jython.- Click
OK. You’ll get some new columns calledauthor_institution 1,author_institution 2, and so on.- Notice that in some cases there are multiple affiliations listed in the same cell. Why is this? What do you think we can do to fix this?
Solution
The regular expression that we used searches for specific keywords and then parses or “pulls” out the text surrounding those keywords. This means that if an author listed more than one affiliation, the expression will pull all of them out of the original text and join them back together in the new column we created. Let’s move onto the next lesson, where we discuss Faceting and Clustering, to find out how we can fix this issue.
Key Points
Removing leading and trailing whitespace from data can make for easier searching and sorting.
Parsing data using regular expressions, which can be simple or complex, can remove unwanted text quickly.