Chapter 9 Sumarizing and subsetting data frames
There are many cases in which a data frame is comprised by all sorts of different data types.
We can extract, process and analyze all sorts of information in a data frame. We wil llearn how to do this using the basic R
commands
9.1 Reading in the dataframe
We will be analyzing the amazing data set complied by Collins, Gilley and Udry (2022) that tackles the question of Why do some of the candies have the same color but smell different?.
To proceed, we will read in the data table included here.
Question 1
-
Download the data frame and load it into
R
into an object calledcandy_table
- What are the classes of each column of data?
Files without an extension
In some cases, you will find that some of the files won’t have an extension (meaning there is no info after the name of the file of the type of the file, this being either .txt
, .csv
and others).
In this case our recommendation is to open the files directly in Rstudio
, Atom
or any other word processor and checking what type of file that is.
In the case of today’s file, you can easily see its a table delimited file.
## color flavor scent type scent.category
## 1 red strawberry sugary, fruity skittle fruit
## 2 red cherry waxy, crayon starburst non-food
## 3 red chocolate old milk m&m non-food
## 4 red choc/peanut faint chocolate peanut m&m chocolate
## 5 orange choc/peanut faint peanut peanut m&m chocolate
## 6 orange orange strong sugary fruit skittle fruit
## 7 yellow lemon sugar, faint citrus skittle fruit
## 8 yellow choc/peanut faint peanut peanut m&m chocolate
## 9 yellow chocolate faint chocolate m&m chocolate
## 10 yellow lemon wax, faint citrus starburst non-food
## 11 green choc/peanut nothing, plastic peanut m&m non-food
## 12 green chocolate old chocolate m&m chocolate
## 13 green lime sugary, fruity skittle fruit
## 14 blue chocolate funk, chocolate m&m non-food
## 15 blue choc/peanut choc/peanut, funk peanut m&m chocolate
## 16 brown grape sugar, floral skittle fruit
## 17 brown chocolate old powdered milk m&m non-food
## 18 pink strawberry crayon starburst non-food
9.2 Summarizing and subsetting data
Now that we have the candy_table
data set, the next step is to determine which elements of it do we want to analyse.
For example, we can create a summary of the total number of colors, the number of samples for each color, the types of candy, ect.
Objects and elements within an element
We know that R
is an object oriented language. These objects can store many elements. In the case of multidimensional objects (more than one dimension), we can easily access them using the $ sign.
For example, in the case of our candy_table
data frame, our elements are the columns. Each column, then, is identified by the header (also called name) of the column.
So, if you want to look at the color
column of candy_table
, all you have to do is write:
$color candy_table
## [1] "red" "red" "red" "red" "orange" "orange" "yellow" "yellow" "yellow" "yellow" "green" "green" "green"
## [14] "blue" "blue" "brown" "brown" "pink"
Question 2
-
How would you look at the
acent
column in thecandy_table
data frame?
9.2.1 Summarizing data
Lets say we want to know what is the total number of colors. Sure, you can count them, but you can also let R
count them for you.
First, you need to know how many unique colors do we have in our column. To do so, use the unique
function:
unique(candy_table$color)
## [1] "red" "orange" "yellow" "green" "blue" "brown" "pink"
Simple. We see that red, orange, yellow, green, blue, brown, pink are our colors. We can count them by hand, or we can ask R
to count them for us:
length(unique(candy_table$color))
## [1] 7
Nested functions
The above example shows an example of a nested function. That means that we have one function inside of a different function.
The way that works is that the first function is the one that goes inside, and the second one outside.
In the length(unique(object))
case, then think it like this:
- Give the the unique colors of candy
- And then give the the total number of step 1
This can get really complex very fast. So there are two options in case you don’t like this:
Creating intermediate objects for each step:
You can create and intermediate object:
unique_object <- unique(object)
and then
length(unique_object)
and the result will be the same
Using the %>% operator
The %>%
operator is a really cool function that can be found in the tidyverse
package. It pretty much means and then
.
For example, we can do the same as above like this:
unique(object) %>% length
and get the same result.
However, we won’t focus on the tidyverse in this class.
Question 3
- Summarize all the columns and add both the code and the results
- Which column has the highest amount of categories?
Finally, we can also summarize the number of samples of a category using the command table
.
If we want to know the number of candies that are of each color we can do it like this:
table(candy_table$color)
##
## blue brown green orange pink red yellow
## 2 2 3 2 1 4 4
Easy! If you want to order them, then just use the sort
function in the table
:
sort(table(candy_table$color))
##
## pink blue brown orange green red yellow
## 1 2 2 2 3 4 4
Question 4
- For each column, what is the most abundant and less abundant category? Add the code
-
How can you organize your results of
sort
in a decreasing order (meaning from larger to smaller)? Add the code
Question 5 (And this is more of a philosophical question)
- Based on your results, which of the columns has the least useful information. (Remember what we talked about in class: Sometimes a large number of categories is unnecessary). Explain why
9.2.2 Summarizing data
Now that we understand the content of the columns, the next step is to remove the columns that do not have any information and start to summarize your results.
In my case, I think that the scent
column is unnecessary, so I want to remove it.
<- subset(candy_table, select = -c(scent))
ct.no_scent ct.no_scent
## color flavor type scent.category
## 1 red strawberry skittle fruit
## 2 red cherry starburst non-food
## 3 red chocolate m&m non-food
## 4 red choc/peanut peanut m&m chocolate
## 5 orange choc/peanut peanut m&m chocolate
## 6 orange orange skittle fruit
## 7 yellow lemon skittle fruit
## 8 yellow choc/peanut peanut m&m chocolate
## 9 yellow chocolate m&m chocolate
## 10 yellow lemon starburst non-food
## 11 green choc/peanut peanut m&m non-food
## 12 green chocolate m&m chocolate
## 13 green lime skittle fruit
## 14 blue chocolate m&m non-food
## 15 blue choc/peanut peanut m&m chocolate
## 16 brown grape skittle fruit
## 17 brown chocolate m&m non-food
## 18 pink strawberry starburst non-food
Question 6
- Remove another column (other tan flavor) and explain why you decided to remove it
Fantastic! Now, lets summarize the data a bit more. Lets say we want to know the number of scent category for each type of candy
We can use two of the functions we already know: subset
and table
:
- Create a subset per type (I will only use skittle):
<- subset(ct.no_scent, type == "skittle")
skit.ct <- table(skit.ct$scent.category) skit.ct
- Transform the list to a data frame and give it column names that are more representative of the data
<- data.frame(skit.ct)
skit.ct colnames(skit.ct) <- c("Flavor", "Skittle")
- Do it for all different colors
Question 7
- Create summary dataframes for all other types of candy
4.Merge all the data into a single table using the command merge
<- merge(skit.ct, starburst.ct, by="Flavor", all = T)
tb1 <- merge(mm.ct, pean.ct, by="Flavor", all = T)
tb2
<- merge(tb1, tb2, by = "Flavor", all = T)
tb.all tb.all
## Flavor Skittle Starburst M&M Peanut M&M
## 1 fruit 5 NA NA NA
## 2 non-food NA 3 3 1
## 3 chocolate NA NA 2 4
Question 8
-
What are the
tb1
andtb2
objects in step 4? - What conclusions can you make from your final table?