Overview
Teaching: 10 min
Exercises: minQuestions
What is a facet in OpenRefine?
What is a filter in OpenRefine?
How can I use filters and facets to explore data OpenRefine?
How can I easily correct common data issues in my data with OpenRefine?
Objectives
Explain what Facets and Filters are
Answer simple questions about the content of a data set using Facets
Use facets and filters to work with a subset of data
Correct simple data problems through a facet
“Image of raw data of keywords for a record set”
“In this situation, we are choosing to transpose cells in rows, into columns in order to weed out duplicate keyterms attached to one record”
“Similar to splitting columns, you can also split multi-valued cells by a delimiter”
“Now each record is represented by a column of keywords”
“We can adjust the capitalization of the keywords by transforming the data in the cells to Titlecase”
“Keywords are now in title case”
Facets are one of the most useful features of OpenRefine and can help both get an overview of the data in a project as well as helping you bring more consistency to the data.
A ‘Facet’ groups all the values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time. The facet information always appears in the left hand panel in the OpenRefine interface.
The simplest type of Facet is called a ‘Text facet’. This simply groups all the text values in a column and lists each value with the number of records it appears in.
To create a Text Facet for a column, click on the drop down menu at the top of the publisher column and choose Facet -> Text Facet
. The facet will then appear in the left hand panel.
This facet consists of a list of values used in the data. You can filter the data displayed by clicking on one of these headings.
You can include multiple values from the facet in a filter at one time by using the Include
option which appears when you put your mouse over a value in the Facet.
You can also invert
the filter to show all records which do not match your selected values. This option appears at the top of the Facet panel when you select a value from the facet to apply as a filter.
Let’s create a text facet
- Click on the drop down menu at the top of the publisher column and choose
Facet > Text Facet
. The facet will then appear in the left hand panel- Filter by facet by clicking or
include
- Select multiple values to
include
- You can ‘invert’ your selections to
exclude
- Include a value and then look at top to invert inclusion.
“All of the facets will appear on the left side, from here you can directly edit the content of any of these facets by clicking on them”
As well as using Facets to filter the data displayed in OpenRefine you can also apply ‘Text Filters’ which looks for a particular piece of text appearing in a column. Text filters are applied by clicking the drop down menu at the top of the column you want to apply the filter to and choosing ‘Text filter’.
As with Facets, the Filter options appear in the left hand panel in OpenRefine. Simply type in the text you want to use in the Filter to display only rows which contain that text in the relevant column.
You can also use regular expressions in the filter.
It is very important to note that when you have filtered the data displayed in OpenRefine, any operations you carry out will apply only to the rows that match the filter - that is the data currently being displayed.
As well as ‘Text facets’ Refine also supports a range of other types of faceting. These include:
Numeric and Timeline facets display graphs instead of lists of values. The graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed.
Scatterplot facets are less commonly used - for further information on these see the tutorial at http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Exploring_the_data_with_scatter_plots
Custom facets are a range of different types of facets. Some of the default custom facets are:
“How to run duplicates facet”
“When ‘true’ is selected, OpenRefine will show the rows that contain duplicate values. You will notice that all empty rows are considered to have the value, NULL, and therefore are counted as duplicates”
Facets are intended to group together common values and OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory. If you create a facet where there are many unique values (for example, a facet on a ‘book title’ column in a data set that has one row per book) the facet created will be very large and may either slow down the application, or OpenRefine will not create the facet.
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings New York
and new york
are very likely to refer to the same concept and just have capitalization differences. Likewise, Gödel
and Godel
probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
Keyword 4
Text Facet we created in the earlier, click the Cluster
button.Method
and the Keying Function
. Try different combinations to
see what different mergers of values are suggested.nearest neighbor
method and levenshtein
distance function. It should identify one clusters.Merge?
box beside each, then click Merge Selected and Recluster
to apply the corrections to the dataset.Methods
and Keying Functions
again, to see what new merges are suggested. You may find there are
still improvements that can be made, but don’t Merge
again; just Close
when you’re done. We’ll now
see other operations that will help us detect and correct the remaining problems, and that have other, more general uses. “Switching Methods to key collision
allows you to identify other potential clusters. Adjusting the Ngram Size
will allow you to find the words that could be clustered based on natural language processing identifying the nearest possible phoneme/syllable/base pair”
When a table is imported into OpenRefine, all columns are treated as having text values. We saw earlier how we can sort column values as numbers, but this does not change the cells in a column from text to numbers. Rather, this interprets the values as numbers for the purposes of sorting but keeps the underlying data type as is. We can, however, transform columns to other data types (e.g. number or date) using the Edit cells
> Common transforms
feature. Here we will experiment changing columns to numbers and see what additional capabilities that grants us.
Be sure to remove any Text filter
facets you have enabled from the left panel so that we can examine our whole dataset. You can remove an existing facet by clicking the x
in the upper left of that facet window.
To transform cells in the recordID
column to numbers, click the down arrow for that column, then Edit cells
> Common transforms…
> To number
. You will notice the recordID
values change from left-justified to right-justified, and black to green color.
Key Points
You can use facets and filters to explore your data
You can use facets and filters work with a subset of data in OpenRefine
You can easily correct common data issues from a Facet