Chapter 9 Sumarizing and subsetting data frames

There are many cases in which a data frame is comprised by all sorts of different data types.

We can extract, process and analyze all sorts of information in a data frame. We wil llearn how to do this using the basic R commands

9.1 Reading in the dataframe

We will be analyzing the amazing data set complied by Collins, Gilley and Udry (2022) that tackles the question of Why do some of the candies have the same color but smell different?.

To proceed, we will read in the data table included here.

Question 1

  • Download the data frame and load it into R into an object called candy_table
  • What are the classes of each column of data?

Files without an extension

In some cases, you will find that some of the files won’t have an extension (meaning there is no info after the name of the file of the type of the file, this being either .txt, .csv and others).

In this case our recommendation is to open the files directly in Rstudio, Atom or any other word processor and checking what type of file that is.

In the case of today’s file, you can easily see its a table delimited file.

##     color      flavor               scent       type scent.category
## 1     red  strawberry      sugary, fruity    skittle          fruit
## 2     red      cherry        waxy, crayon  starburst       non-food
## 3     red   chocolate            old milk        m&m       non-food
## 4     red choc/peanut     faint chocolate peanut m&m      chocolate
## 5  orange choc/peanut        faint peanut peanut m&m      chocolate
## 6  orange      orange strong sugary fruit    skittle          fruit
## 7  yellow       lemon sugar, faint citrus    skittle          fruit
## 8  yellow choc/peanut        faint peanut peanut m&m      chocolate
## 9  yellow   chocolate     faint chocolate        m&m      chocolate
## 10 yellow       lemon   wax, faint citrus  starburst       non-food
## 11  green choc/peanut    nothing, plastic peanut m&m       non-food
## 12  green   chocolate       old chocolate        m&m      chocolate
## 13  green        lime      sugary, fruity    skittle          fruit
## 14   blue   chocolate     funk, chocolate        m&m       non-food
## 15   blue choc/peanut   choc/peanut, funk peanut m&m      chocolate
## 16  brown       grape       sugar, floral    skittle          fruit
## 17  brown   chocolate   old powdered milk        m&m       non-food
## 18   pink  strawberry              crayon  starburst       non-food

9.2 Summarizing and subsetting data

Now that we have the candy_table data set, the next step is to determine which elements of it do we want to analyse.

For example, we can create a summary of the total number of colors, the number of samples for each color, the types of candy, ect.

Objects and elements within an element

We know that R is an object oriented language. These objects can store many elements. In the case of multidimensional objects (more than one dimension), we can easily access them using the $ sign.

For example, in the case of our candy_table data frame, our elements are the columns. Each column, then, is identified by the header (also called name) of the column.

So, if you want to look at the color column of candy_table, all you have to do is write:

candy_table$color
##  [1] "red"    "red"    "red"    "red"    "orange" "orange" "yellow" "yellow" "yellow" "yellow" "green"  "green"  "green" 
## [14] "blue"   "blue"   "brown"  "brown"  "pink"

Question 2

  • How would you look at the acent column in the candy_table data frame?

9.2.1 Summarizing data

Lets say we want to know what is the total number of colors. Sure, you can count them, but you can also let R count them for you.

First, you need to know how many unique colors do we have in our column. To do so, use the unique function:

unique(candy_table$color)
## [1] "red"    "orange" "yellow" "green"  "blue"   "brown"  "pink"

Simple. We see that red, orange, yellow, green, blue, brown, pink are our colors. We can count them by hand, or we can ask R to count them for us:

length(unique(candy_table$color))
## [1] 7

Nested functions

The above example shows an example of a nested function. That means that we have one function inside of a different function.

The way that works is that the first function is the one that goes inside, and the second one outside.

In the length(unique(object)) case, then think it like this:

  1. Give the the unique colors of candy
  2. And then give the the total number of step 1

This can get really complex very fast. So there are two options in case you don’t like this:

Creating intermediate objects for each step:

You can create and intermediate object:

unique_object <- unique(object)

and then

length(unique_object)

and the result will be the same

Using the %>% operator

The %>% operator is a really cool function that can be found in the tidyverse package. It pretty much means and then.

For example, we can do the same as above like this:

unique(object) %>% length

and get the same result.

However, we won’t focus on the tidyverse in this class.


Question 3

  • Summarize all the columns and add both the code and the results
  • Which column has the highest amount of categories?

Finally, we can also summarize the number of samples of a category using the command table.

If we want to know the number of candies that are of each color we can do it like this:

table(candy_table$color)
## 
##   blue  brown  green orange   pink    red yellow 
##      2      2      3      2      1      4      4

Easy! If you want to order them, then just use the sort function in the table:

sort(table(candy_table$color))
## 
##   pink   blue  brown orange  green    red yellow 
##      1      2      2      2      3      4      4

Question 4

  • For each column, what is the most abundant and less abundant category? Add the code
  • How can you organize your results of sort in a decreasing order (meaning from larger to smaller)? Add the code

Question 5 (And this is more of a philosophical question)

  • Based on your results, which of the columns has the least useful information. (Remember what we talked about in class: Sometimes a large number of categories is unnecessary). Explain why

9.2.2 Summarizing data

Now that we understand the content of the columns, the next step is to remove the columns that do not have any information and start to summarize your results.

In my case, I think that the scent column is unnecessary, so I want to remove it.

ct.no_scent <- subset(candy_table, select = -c(scent))
ct.no_scent
##     color      flavor       type scent.category
## 1     red  strawberry    skittle          fruit
## 2     red      cherry  starburst       non-food
## 3     red   chocolate        m&m       non-food
## 4     red choc/peanut peanut m&m      chocolate
## 5  orange choc/peanut peanut m&m      chocolate
## 6  orange      orange    skittle          fruit
## 7  yellow       lemon    skittle          fruit
## 8  yellow choc/peanut peanut m&m      chocolate
## 9  yellow   chocolate        m&m      chocolate
## 10 yellow       lemon  starburst       non-food
## 11  green choc/peanut peanut m&m       non-food
## 12  green   chocolate        m&m      chocolate
## 13  green        lime    skittle          fruit
## 14   blue   chocolate        m&m       non-food
## 15   blue choc/peanut peanut m&m      chocolate
## 16  brown       grape    skittle          fruit
## 17  brown   chocolate        m&m       non-food
## 18   pink  strawberry  starburst       non-food

Question 6

  • Remove another column (other tan flavor) and explain why you decided to remove it

Fantastic! Now, lets summarize the data a bit more. Lets say we want to know the number of scent category for each type of candy

We can use two of the functions we already know: subset and table:

  1. Create a subset per type (I will only use skittle):
skit.ct <- subset(ct.no_scent, type == "skittle")
skit.ct <- table(skit.ct$scent.category)
  1. Transform the list to a data frame and give it column names that are more representative of the data
skit.ct <- data.frame(skit.ct)
colnames(skit.ct) <- c("Flavor", "Skittle")
  1. Do it for all different colors

Question 7

  • Create summary dataframes for all other types of candy

4.Merge all the data into a single table using the command merge

tb1 <- merge(skit.ct, starburst.ct, by="Flavor", all = T)
tb2 <- merge(mm.ct, pean.ct, by="Flavor", all = T)

tb.all <- merge(tb1, tb2, by = "Flavor", all = T)
tb.all
##      Flavor Skittle Starburst M&M Peanut M&M
## 1     fruit       5        NA  NA         NA
## 2  non-food      NA         3   3          1
## 3 chocolate      NA        NA   2          4

Question 8

  • What are the tb1 and tb2 objects in step 4?
  • What conclusions can you make from your final table?