Managing Data

After data is brought into a workspace, data management can start taking place. This includes editing the variables, changing the dataset (sorting, merging, reshaping, etc.), and taking a subset of the data, which means choosing the part that meets the certain criteria.

During this managing process, we will be using syntax and operators introduced in previous sections.

Changing Entries

There are many different ways to edit the variables in a dataset. Specifically, we can create and add new variables into it by using the assignment operator <-. If we are not satisfied with the current data, we can always code it differently. These are shown below:

testData <- data.frame(x = c(1:10), y = c(11:20))  # dataframe created
testData
##     x  y
## 1   1 11
## 2   2 12
## 3   3 13
## 4   4 14
## 5   5 15
## 6   6 16
## 7   7 17
## 8   8 18
## 9   9 19
## 10 10 20
testData$sum <- testData$x + testData$y      # variable that adds the first two columns
testData$product <- testData$x * testData$y  # variable that multiplies the first two columns
testData
##     x  y sum product
## 1   1 11  12      11
## 2   2 12  14      24
## 3   3 13  16      39
## 4   4 14  18      56
## 5   5 15  20      75
## 6   6 16  22      96
## 7   7 17  24     119
## 8   8 18  26     144
## 9   9 19  28     171
## 10 10 20  30     200

If attach is used on the dataset, the name of the dataset can be omitted from the calling of each variable. The detach function can then be used to go back to using the dataset name. The same code from above is shown below with attach and detach:

testData <- data.frame(x = c(1:10), y = c(11:20))  # dataframe created
testData
##     x  y
## 1   1 11
## 2   2 12
## 3   3 13
## 4   4 14
## 5   5 15
## 6   6 16
## 7   7 17
## 8   8 18
## 9   9 19
## 10 10 20
attach(testData)           # for ommitting the dataset name
testData$sum <- x + y      # a variable that adds the first two columns
testData$product <- x * y  # a variable that multiplies the first two columns
testData
##     x  y sum product
## 1   1 11  12      11
## 2   2 12  14      24
## 3   3 13  16      39
## 4   4 14  18      56
## 5   5 15  20      75
## 6   6 16  22      96
## 7   7 17  24     119
## 8   8 18  26     144
## 9   9 19  28     171
## 10 10 20  30     200
detach(testData)  # again, for using the dataset name

Defining Factors Types

If there is a column that has values that are counted as one group or another (such as numbers that are really factors), we can make a new column that has a column of factors as text. For example:

attach(testData)
testData$overUnder[product < 100] = "Bad"    # create factor for under 100
testData$overUnder[product >= 100] = "Good"  # create factor for over 100
testData
##     x  y sum product overUnder
## 1   1 11  12      11       Bad
## 2   2 12  14      24       Bad
## 3   3 13  16      39       Bad
## 4   4 14  18      56       Bad
## 5   5 15  20      75       Bad
## 6   6 16  22      96       Bad
## 7   7 17  24     119      Good
## 8   8 18  26     144      Good
## 9   9 19  28     171      Good
## 10 10 20  30     200      Good
detach(testData)

Rearranging

In this section, how to sort and merge data will be shown:

Sorting

testData$dollarAmount <- c(25, 60, 37, 57, 18, 36, 47, 37, 47, 80)
attach(testData)
sortedA <- testData[order(dollarAmount),]  # sorting in ascending order
sortedA2 <- sort(testData$dollarAmount)    # sort() can also be used
sortedA
##     x  y sum product overUnder dollarAmount
## 5   5 15  20      75       Bad           18
## 1   1 11  12      11       Bad           25
## 6   6 16  22      96       Bad           36
## 3   3 13  16      39       Bad           37
## 8   8 18  26     144      Good           37
## 7   7 17  24     119      Good           47
## 9   9 19  28     171      Good           47
## 4   4 14  18      56       Bad           57
## 2   2 12  14      24       Bad           60
## 10 10 20  30     200      Good           80
sortedA2
##  [1] 18 25 36 37 37 47 47 57 60 80
sortedD <- testData[order(-dollarAmount),]                  # sorting in descending order
sortedD2 <- sort(testData$dollarAmount, decreasing = TRUE)  # sort() can also be used
sortedD
##     x  y sum product overUnder dollarAmount
## 10 10 20  30     200      Good           80
## 2   2 12  14      24       Bad           60
## 4   4 14  18      56       Bad           57
## 7   7 17  24     119      Good           47
## 9   9 19  28     171      Good           47
## 3   3 13  16      39       Bad           37
## 8   8 18  26     144      Good           37
## 6   6 16  22      96       Bad           36
## 1   1 11  12      11       Bad           25
## 5   5 15  20      75       Bad           18
sortedD2
##  [1] 80 60 57 47 47 37 37 36 25 18
detach(testData)

Merging

testDataset2 <- data.frame(numbers = c(1), moreNumbers = c(11))
testDataset2
##   numbers moreNumbers
## 1       1          11
totalSet <- merge(testData, testDataset2)  # merging two datasets
totalSet
##     x  y sum product overUnder dollarAmount numbers moreNumbers
## 1   1 11  12      11       Bad           25       1          11
## 2   2 12  14      24       Bad           60       1          11
## 3   3 13  16      39       Bad           37       1          11
## 4   4 14  18      56       Bad           57       1          11
## 5   5 15  20      75       Bad           18       1          11
## 6   6 16  22      96       Bad           36       1          11
## 7   7 17  24     119      Good           47       1          11
## 8   8 18  26     144      Good           37       1          11
## 9   9 19  28     171      Good           47       1          11
## 10 10 20  30     200      Good           80       1          11

Note: rbind() and cbind() can be used to add a row or column to a dataset. Information about these functions and the use of conditional operators can be found in Basic Syntax page.

Creating a Subset of the Data

R has powerful indexing features for creating a subset of the data. The following codes demonstrate ways to delete variables in a data frame.

totalSet[2,]        # gets all of row 2
##   x  y sum product overUnder dollarAmount numbers moreNumbers
## 2 2 12  14      24       Bad           60       1          11
totalSet$x <- NULL  # this excludes or removes the entire column
totalSet
##     y sum product overUnder dollarAmount numbers moreNumbers
## 1  11  12      11       Bad           25       1          11
## 2  12  14      24       Bad           60       1          11
## 3  13  16      39       Bad           37       1          11
## 4  14  18      56       Bad           57       1          11
## 5  15  20      75       Bad           18       1          11
## 6  16  22      96       Bad           36       1          11
## 7  17  24     119      Good           47       1          11
## 8  18  26     144      Good           37       1          11
## 9  19  28     171      Good           47       1          11
## 10 20  30     200      Good           80       1          11

Sometimes you may want to subset the data by selecting rows corresponding only to certain values. Suppose, for instance, that you wanted only the rows in the above example where dollarAmount was equal to $25, $57, or $80. Then you could use the syntax %in% as shown below.

totalSet[totalSet$dollarAmount %in% c(25, 57, 80), ]

Note: Base R commands are not always the best suited for managing data. Instead, implementing functions from the tidyr or dplyr package may be a better option.