Cleaning Data

Students cheer on the Redhawks during a sporting event at Miami University.

Raw data often must first be cleaned before it is useable. The tidyr package makes managing “dirty data” easier and more efficient. This page details the four main commands in tidyr and provides a brief description of a couple of additional frequently used commands. After cleaning the data, further manipulation is typically needed. The Data Manipulation page provides additional information on this.

Note: A key-value pair is comprised of the key, which is the name or identifier of the variable and its respective value. It is the simplest way to record data. 

gather() 

This function is the first main command in the tidyr package. It collapses multiple columns into key-value pairs. It is often used when the column names are characteristics of a specific variable of interest. In other words, column names should be entries in a single column. Using this function will transform the data table into a long format, rather than a wide one. The basic syntax is below:

Parameters

  • data: dataset to be used
  • key: name for the key part in the key-value pair
  • value: column name for value

Example:

 library(tidyr)
x <- data.frame(A = c(-1, -1, 1, 1), B = c(-1, 1, -1, 1), replicate1 = c("y11", "y12", "y13", "y14")
, replicate2 = c("y21", "y22", "y23", "y24")
, replicate3 = c("y31", "y32", "y33", "y34"))
x
## A B replicate1 replicate2 replicate3
## 1 -1 -1 y11 y21 y31
## 2 -1 1 y12 y22 y32
## 3 1 -1 y13 y23 y33
## 4 1 1 y14 y24 y34
gatherx = gather(x, factor, replicate, -A, -B)
gatherx
## A B factor replicate
## 1 -1 -1 replicate1 y11
## 2 -1 1 replicate1 y12
## 3 1 -1 replicate1 y13
## 4 1 1 replicate1 y14
## 5 -1 -1 replicate2 y21
## 6 -1 1 replicate2 y22
## 7 1 -1 replicate2 y23
## 8 1 1 replicate2 y24
## 9 -1 -1 replicate3 y31
## 10 -1 1 replicate3 y32
## 11 1 -1 replicate3 y33
## 12 1 1 replicate3 y34

spread()

This function is the second main command in the tidyr package. It takes a key-value pair and spreads it over multiple columns, creating a data table with a wide format, rather than a long one. 

spread(data, key, value)

Example:

 spreadx <- spread(gatherx, factor, replicate)
spreadx

## A B replicate1 replicate2 replicate3
## 1 -1 -1 y11 y21 y31
## 2 -1 1 y12 y22 y32
## 3 1 -1 y13 y23 y33
## 4 1 1 y14 y24 y34

Note : spread() and gather() are opposite functions

separate() 

This is the third main command in the tidyr package. When a column holds two variables, separate() can be used to split it into two columns. 

separate(data, col, into, sep = ",")

Parameters:

  • data: data to be used
  • col: column to separate
  • into: what to separate column into
  • sep: specify the operator

Example:


 separatex <- data.frame(A = c("-1/-1", "-1/1", "1/-1"), B = c("", "", "")
 , replicate1 = c("y11", "y12", "y13")
 , replicate2 = c("y21", "y22", "y23")
 , replicate3 = c("y31", "y32", "y33"))
 separatex
## A B replicate1 replicate2 replicate3 ## 1 -1/-1 y11 y21 y31 ## 2 -1/1 y12 y22 y32 ## 3 1/-1 y13 y23 y33 separate(separatex, A, c("A", "B"), sep = "/")
## A B B replicate1 replicate2 replicate3 ## 1 -1 -1 y11 y21 y31 ## 2 -1 1 y12 y22 y32 ## 3 1 -1 y13 y23 y33

unite()

This is the fourth main command in the tidyr package. unite() is used to bring two columns together.

unite(data, col, ..., xep = "_")

Example:

 unitex = unite(x, combination, replicate1, replicate2, replicate3, sep = "-")
unitex

## A B combination
## 1 -1 -1 y11-y21-y31
## 2 -1 1 y12-y22-y32
## 3 1 -1 y13-y23-y33
## 4 1 1 y14-y24-y24

Note: separate() and unite() are opposite functions.

fill()

This function can be used to replace a column of missing values with a vector of values provided by the user. 

 fillx <- data.frame(A = c(-1, -1, 1, 1), B = c(-1, 1, -1, 1)
, replicate1 = c("y11", "y12", NA, NA)
, replicate2 = c("y21", "y22", NA, NA)
, replicate3 = c("y31", "y32", NA, NA))

fill(fillx, replicate1:replicate3)
## A B replicate1 replicate2 replicate3 ## 1 -1 -1 y11 y21 y31 ## 2 -1 1 y12 y22 y32 ## 3 1 -1 y12 y22 y32 ## 4 1 1 y12 y22 y32

replace_na()

This function is commonly used for an incomplete dataset. It replaces any na values with the arguments provided.

replace_na(data, replace = list())

Parameters

  • data: data to be used
  • replace: list of columns and what to replace with

Example:

z <- data.frame(x = c(1, 2, NA), y = c(1, NA, 3))
replace_na(z, list(x = "un", y = "un"))

Tidyverse Package

The tidyverse package is a collection of packages that work together for data management. All the packages in tidyverse use the same kinds of objects to manipulate data.

The packages in tidyverse are:

  • ggplot2
  • tibble
  • tidyr
  • readr
  • purr
  • dplyr

To look more at this package, please the tidyverse website for it.

Need a Refresher?

Go back to the beginner tutorials.