Data Manipulation

Students cheer on the Redhawks during a sporting event at Miami University.

After the data is cleaned, the dplyr package can be used for data manipulation. If you are familiar with SQL, then you may find many similar functionalities when using dpylr functions. Throughout this page examples will be shown using the mtcars dataset, which has 32 observations and 11 variables. Though not a large dataset, it comes pre-loaded in R and it's contents can be used to nicely illustrate how dpylr functions can manipulate data.

tbl_df

When a datasets contains many rows and/or columns, it can be difficult to understand what you are actually working with. The tbl_df() function turns a dataframe structure into a tbl structure (short for "tibble"). Using a tibble structure can make it much easier to navigate, view, and manipulate the contents of a dataset because every column is known to be associated with a variable and every row is known to be associated with an observation.

Example:

library(dplyr)  # load package
data("mtcars")  # load mtcars dataset using data() function
mtcars

##                   mpg  cyl  disp  hp    drat   wt   qsec   vs   am   gear   carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fletwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

tbl_df(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 4 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 5 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 6 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## # with 26 more rows

glimpse()

As mentioned above, tbl stands for "tibble." Besides the data structure, there is also a package called tibble, which contains a glimpse() function. This is another way to look at a dataset without printing  out all the contents.

Example:

glimpse(mtcars)

## Observations: 32
## Variables: 11
## $ mpg   21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl   6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp  160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp    110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat  3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt    2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec  16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs    0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am    1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear  4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb  4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

6 Key Data Manipulation Functions in dplyr

The six functions are:

  • select()
  • mutate()
  • filter()
  • arrange()
  • group_by()
  • summarize()

select() and mutate() are used on variables (columns) within a dataset. filter() and arrange() are used on the observations (rows) in a dataset. arrange() and summarize() are used across a number of rows and columns.

Note: For all six of these functions, the data must be converted to a tibble format.

In addition to these functions, we will discuss a short-form way to code in R. Piping (i.e., using the symbolic notation %>%) is a coding tool that does not require you to explicitly reference the object upon which you are wanting to perform some action. This tool will be further discussed at the end of the page.

select()

The select() function is used to extract only certain columns from a table, just like in SQL.

Example:

select(mtcars, cyl, mpg)

##                     cyl  mpg
## Mazda RX4             6 21.0
## Mazda RX4 Wag         6 21.0
## Datsun 710            4 22.8
## Hornet 4 Drive        6 21.4
## Hornet Sportabout     8 18.7
## Valiant               6 18.1
## Duster 360            8 14.3
## Merc 240D             4 24.4
## Merc 230              4 22.8
## Merc 280              6 19.2
## Merc 280C             6 17.8
## Merc 450SE            8 16.4
## Merc 450SL            8 17.3
## Merc 450SLC           8 15.2
## Cadillac Fleetwood    8 10.4
## Lincoln Continental   8 10.4
## Chrysler Imperial     8 14.7
## Fiat 128              4 32.4
## Honda Civic           4 30.4
## Toyota Corolla        4 33.9
## Toyota Corona         4 21.5
## Dodge Challenger      8 15.5
## AMC Javelin           8 15.2
## Camaro Z28            8 13.3
## Pontiac Firebird      8 19.2
## Fiat X1-9             4 27.3
## Porsche 914-2         4 26.0
## Lotus Europa          4 30.4
## Ford Pantera L        8 15.8
## Ferrari Dino          6 19.7
## Maserati Bora         8 15.0
## Volvo 142E            4 21.4

The variables cyl and mpg are selected as the only columns from the mtcars dataset.

Note: The dash (-) sign can be used with select() to take out columns in a range that are not needed.

Below are arguments that can be used with select().

starts_with()

Selects the columns that start with a given string.

Example:

select(mtcars, starts_with("c"))

##                     cyl carb
## Mazda RX4             6    4
## Mazda RX4 Wag         6    4
## Datsun 710            4    1
## Hornet 4 Drive        6    1
## Hornet Sportabout     8    2
## Valiant               6    1
## Duster 360            8    4
## Merc 240D             4    2
## Merc 230              4    2
## Merc 280              6    4
## Merc 280C             6    4
## Merc 450SE            8    3
## Merc 450SL            8    3
## Merc 450SLC           8    3
## Cadillac Fleetwood    8    4
## Lincoln Continental   8    4
## Chrysler Imperial     8    4
## Fiat 128              4    1
## Honda Civic           4    2
## Toyota Corolla        4    1
## Toyota Corona         4    1
## Dodge Challenger      8    2
## AMC Javelin           8    2
## Camaro Z28            8    4
## Pontiac Firebird      8    2
## Fiat X1-9             4    1
## Porsche 914-2         4    2
## Lotus Europa          4    2
## Ford Pantera L        8    4
## Ferrari Dino          6    6
## Maserati Bora         8    8
## Volvo 142E            4    2

ends_with()

Selects the columns that end with a given string.

Example:

select(mtcars, ends_with("p"))

##                      disp  hp
## Mazda RX4           160.0 110
## Mazda RX4 Wag       160.0 110
## Datsun 710          108.0  93
## Hornet 4 Drive      258.0 110
## Hornet Sportabout   360.0 175
## Valiant             225.0 105
## Duster 360          360.0 245
## Merc 240D           146.7  62
## Merc 230            140.8  95
## Merc 280            167.6 123
## Merc 280C           167.6 123
## Merc 450SE          275.8 180
## Merc 450SL          275.8 180
## Merc 450SLC         275.8 180
## Cadillac Fleetwood  472.0 205
## Lincoln Continental 460.0 215
## Chrysler Imperial   440.0 230
## Fiat 128             78.7  66
## Honda Civic          75.7  52
## Toyota Corolla       71.1  65
## Toyota Corona       120.1  97
## Dodge Challenger    318.0 150
## AMC Javelin         304.0 150
## Camaro Z28          350.0 245
## Pontiac Firebird    400.0 175
## Fiat X1-9            79.0  66
## Porsche 914-2       120.3  91
## Lotus Europa         95.1 113
## Ford Pantera L      351.0 264
## Ferrari Dino        145.0 175
## Maserati Bora       301.0 335
## Volvo 142E          121.0 109

contains()

Selects the columns that contain a given string.

Example:

select(mtcars, contains("s"))

##                      disp  qsec vs
## Mazda RX4           160.0 16.46  0
## Mazda RX4 Wag       160.0 17.02  0
## Datsun 710          108.0 18.61  1
## Hornet 4 Drive      258.0 19.44  1
## Hornet Sportabout   360.0 17.02  0
## Valiant             225.0 20.22  1
## Duster 360          360.0 15.84  0
## Merc 240D           146.7 20.00  1
## Merc 230            140.8 22.90  1
## Merc 280            167.6 18.30  1
## Merc 280C           167.6 18.90  1
## Merc 450SE          275.8 17.40  0
## Merc 450SL          275.8 17.60  0
## Merc 450SLC         275.8 18.00  0
## Cadillac Fleetwood  472.0 17.98  0
## Lincoln Continental 460.0 17.82  0
## Chrysler Imperial   440.0 17.42  0
## Fiat 128             78.7 19.47  1
## Honda Civic          75.7 18.52  1
## Toyota Corolla       71.1 19.90  1
## Toyota Corona       120.1 20.01  1
## Dodge Challenger    318.0 16.87  0
## AMC Javelin         304.0 17.30  0
## Camaro Z28          350.0 15.41  0
## Pontiac Firebird    400.0 17.05  0
## Fiat X1-9            79.0 18.90  1
## Porsche 914-2       120.3 16.70  0
## Lotus Europa         95.1 16.90  1
## Ford Pantera L      351.0 14.50  0
## Ferrari Dino        145.0 15.50  0
## Maserati Bora       301.0 14.60  0
## Volvo 142E          121.0 18.60  1

matches()

matches() is very similar to contains(); however, whereas contains() searches for a given string that can also contain other characters before and after the given string, matches() only looks for an exact match.

Example:

select(mtcars, matches("vs"))

##                     vs
## Mazda RX4            0
## Mazda RX4 Wag        0
## Datsun 710           1
## Hornet 4 Drive       1
## Hornet Sportabout    0
## Valiant              1
## Duster 360           0
## Merc 240D            1
## Merc 230             1
## Merc 280             1
## Merc 280C            1
## Merc 450SE           0
## Merc 450SL           0
## Merc 450SLC          0
## Cadillac Fleetwood   0
## Lincoln Continental  0
## Chrysler Imperial    0
## Fiat 128             1
## Honda Civic          1
## Toyota Corolla       1
## Toyota Corona        1
## Dodge Challenger     0
## AMC Javelin          0
## Camaro Z28           0
## Pontiac Firebird     0
## Fiat X1-9            1
## Porsche 914-2        0
## Lotus Europa         1
## Ford Pantera L       0
## Ferrari Dino         0
## Maserati Bora        0
## Volvo 142E           1

num_range()

Selects the entries in a column within the given range.

Example:

newCol = mtcars
names(newCol) = c('x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11')
select(newCol, num_range("x", 2:5))

##                     x2    x3  x4   x5
## Mazda RX4            6 160.0 110 3.90
## Mazda RX4 Wag        6 160.0 110 3.90
## Datsun 710           4 108.0  93 3.85
## Hornet 4 Drive       6 258.0 110 3.08
## Hornet Sportabout    8 360.0 175 3.15
## Valiant              6 225.0 105 2.76
## Duster 360           8 360.0 245 3.21
## Merc 240D            4 146.7  62 3.69
## Merc 230             4 140.8  95 3.92
## Merc 280             6 167.6 123 3.92
## Merc 280C            6 167.6 123 3.92
## Merc 450SE           8 275.8 180 3.07
## Merc 450SL           8 275.8 180 3.07
## Merc 450SLC          8 275.8 180 3.07
## Cadillac Fleetwood   8 472.0 205 2.93
## Lincoln Continental  8 460.0 215 3.00
## Chrysler Imperial    8 440.0 230 3.23
## Fiat 128             4  78.7  66 4.08
## Honda Civic          4  75.7  52 4.93
## Toyota Corolla       4  71.1  65 4.22
## Toyota Corona        4 120.1  97 3.70
## Dodge Challenger     8 318.0 150 2.76
## AMC Javelin          8 304.0 150 3.15
## Camaro Z28           8 350.0 245 3.73
## Pontiac Firebird     8 400.0 175 3.08
## Fiat X1-9            4  79.0  66 4.08
## Porsche 914-2        4 120.3  91 4.43
## Lotus Europa         4  95.1 113 3.77
## Ford Pantera L       8 351.0 264 4.22
## Ferrari Dino         6 145.0 175 3.62
## Maserati Bora        8 301.0 335 3.54
## Volvo 142E           4 121.0 109 4.11

one_of()

This function displays only unique values.

Example:

select(mtcars, one_of(c('cyl', 'hp', 'drat')))

##                     cyl  hp drat
## Mazda RX4             6 110 3.90
## Mazda RX4 Wag         6 110 3.90
## Datsun 710            4  93 3.85
## Hornet 4 Drive        6 110 3.08
## Hornet Sportabout     8 175 3.15
## Valiant               6 105 2.76
## Duster 360            8 245 3.21
## Merc 240D             4  62 3.69
## Merc 230              4  95 3.92
## Merc 280              6 123 3.92
## Merc 280C             6 123 3.92
## Merc 450SE            8 180 3.07
## Merc 450SL            8 180 3.07
## Merc 450SLC           8 180 3.07
## Cadillac Fleetwood    8 205 2.93
## Lincoln Continental   8 215 3.00
## Chrysler Imperial     8 230 3.23
## Fiat 128              4  66 4.08
## Honda Civic           4  52 4.93
## Toyota Corolla        4  65 4.22
## Toyota Corona         4  97 3.70
## Dodge Challenger      8 150 2.76
## AMC Javelin           8 150 3.15
## Camaro Z28            8 245 3.73
## Pontiac Firebird      8 175 3.08
## Fiat X1-9             4  66 4.08
## Porsche 914-2         4  91 4.43
## Lotus Europa          4 113 3.77
## Ford Pantera L        8 264 4.22
## Ferrari Dino          6 175 3.62
## Maserati Bora         8 335 3.54
## Volvo 142E            4 109 4.11

mutate()

Adds a new column using an equation or logical test that incorporates data from other columns.

Example:

mutate(mtcars, hpCyl = hp + cyl)

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb hpCyl
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   116
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   116
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    97
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   116
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   183
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   111
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   253
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    66
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    99
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   129
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4   129
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3   188
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3   188
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3   188
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4   213
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4   223
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4   238
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1    70
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2    56
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1    69
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   101
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2   158
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2   158
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4   253
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2   183
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1    70
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2    95
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2   117
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4   272
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6   181
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8   343
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   113

A new column named hpCyl will be made that adds hp and cyl.

filter()

Filters out entries in a column based on a logical criterion; this chooses rows the same way that select() chooses columns.

Example:

filter(mtcars, cyl<8)

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 6  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 7  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 8  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 9  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 10 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 11 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 12 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 13 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 14 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 15 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 16 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 17 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 18 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

arrange()

arrange() sorts a dataset by the variable specified. If one or more extra variables are specified, then they are used for sub-sorting (i.e. tie breakers), with the variables entered first taking priority over those that come later.

Example:

arrange(mtcars, cyl, hp)

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 2  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 3  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 4  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 5  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 6  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 7  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 8  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 9  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 10 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## 11 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 12 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 13 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 14 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 15 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 16 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 17 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 18 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 19 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 20 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 21 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 22 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 23 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 24 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 25 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 26 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 27 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 28 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 29 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 30 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 31 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 32 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Notice how cyl is in sorted in order, and how hp is sorted within each group of cyl. This will play into our next function.

group_by()

Similar to group by in SQL, dplyr's group_by() rearranges the rows to be grouped by one or more variables provided, giving priority to those entered first. It is useful for rearranging the entire dataset by a variable without changing any values. It is primarily used in combination with our next function, summarize().

summarize()

Get a summary value for a column. Note that summarise() and summarize() do the same thing.

Example:

summarise(mtcars, mean(disp))

##      mean(disp)
## 1      230.7219

summarize(mtcars, mean(disp))

##      mean(disp)
## 1      230.7219

Now, look at what happens when we combine summarize() with group_by():

group_by(mtcars, cyl, hp) %>% summarise()

## Source: local data frame [23 x 2]
## Groups: cyl [?]
## 
##      cyl    hp
##     
## 1      4    52
## 2      4    62
## 3      4    65
## 4      4    66
## 5      4    91
## 6      4    93
## 7      4    95
## 8      4    97
## 9      4   109
## 10     4   113
## # ... with 13 more rows

Note: For a useful dplyr reference, see the RStudio cheatsheet at Help -> Cheatsheets -> Data Transformation with dplyr

Piping

In the previous example shown above, noticed we used a weird series of symbols, %>%, in between functions instead of nesting them like View(summarise(group_by(mtcars, cyl, hp))). This was an example of piping. Piping is used to connect lines of code to allow a series of commands to be executed more quickly and in a less memory-intensive manner. It can also be much easier to read, especially when dealing with more complex queries where each function can have multiple arguments, like those below. The lines are connected so that the result of the prior statement is the input into the next statement. This saves time and computer resources that would otherwise be spent defining many variables separately and require R to store a great deal of unnecessary information.

Example:

library(dplyr)
mtcars %>%
filter(hp < 500) %>%
mutate(hp1_1 = hp / 5) %>%
summarise(mean(hp))

## mean(hp)
## 1 146.6875

As shown above, the piping syntax can also be incorporated into a mixture of base R and dplyr functions. In the following example, the last instance of either transmission type for each size engine is created using the negation of the base R command duplicated() in combination with the dplyr commands group_by() and filter().

library(dplyr)
mtcars %>% 
  group_by(cyl) %>% 
  filter(!duplicated(am, fromLast = T))
  
  
## Source: local data frame [6 x 11]
## Groups: cyl [3]
## 
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##             
## 1  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4
## 2  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
## 3  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
## 4  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
## 5  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
## 6  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

The tidyverse Package

The tidyverse package is a collection of packages that work together for data management. All the packages in tidyverse use the same kinds of objects to manipulate data.

The packages in tidyverse are:

  • ggplot2
  • tibble
  • tidyr
  • readr
  • purr
  • dplyr

To look more into the tidyverse package, visit tidyverse.org.

Need a Refresher?

Go back to the beginner tutorials.