Chapter 8 Basic R (2) : Data frames

Objectives:

  1. To understand what a data frame is
  2. To identify and extract rows and columns
  3. To create new data frames based on vectors
  4. To combine different data frames
  5. To learn how to read different data frame types

There are objects that will have rows and columns, such as data frames and matrices. These are two-dimensional objects that we will use the blanket term Tables.

8.1 Creating tables using vectors

We can create tables using vectors

# This is a table with two columns and four rows.
# Columns are Names and Age
# Rows are the age of each person:
names <- c("Kelsey","Javier","Joe","Li")
age <- c(22, 34, 15, 50)
# Names is a character vector and age is a numeric/integer vector, so we can put them together in a data frame:
age.table <- data.frame(names, age, stringsAsFactors = F)
age.table
##    names age
## 1 Kelsey  22
## 2 Javier  34
## 3    Joe  15
## 4     Li  50

or loading existing data as we did in Chapter 4

To see if our new object is correct, we can use the class and the str command. str means structure, and will return the structure of the table:

# We expect the table to be a data frame class
class(age.table)
## [1] "data.frame"
# We expect the structure of the table to be one character column, and one numeric column
str(age.table)
## 'data.frame':    4 obs. of  2 variables:
##  $ names: chr  "Kelsey" "Javier" "Joe" "Li"
##  $ age  : num  22 34 15 50

We should also make sure that our table has the expected dimensions (i.e. Four rows and two columns), We can check this using the dim command:

dim(age.table)
## [1] 4 2

Note that it tells you the number of rows first, and then the number of columns. R always uses this order to deal with tables, so rows first and then columns.

8.1.1 Accessing specific rows and columns

This in integrated further on the way that R handles tables. using the name of the table followed by square brackets will allow you to access rows or columns.

Rows can be accessed on the left side of the square brackets. Use the number of the row you want to access it. So, if you want to access the first row of your age.table data frame, use the following code:

age.table[1, ]
##    names age
## 1 Kelsey  22

Similar syntax can be used to access the columns, but use the right side of the square brackets:

age.table[,1 ]
## [1] "Kelsey" "Javier" "Joe"    "Li"

Last last thing: You can call each column as a vector using the $ command (in data frames)

age.table$names
## [1] "Kelsey" "Javier" "Joe"    "Li"
age.table$age
## [1] 22 34 15 50

Question 1

  • Create a data frame using three vectors: colors, candy, numbers. Add the code and the table.
  • What are the classes for each column?

8.2 Data frames versus matrices

Data frames and matrices are two-dimensional data structures used in R.

Data frames, as shown by the age.table example, can hold on to many different types of one dimensional data structures (such as mix of numeric, character or factor vectors).

# Checking that our data frame IS a data frame
class(age.table)
## [1] "data.frame"

Matrices are numerical two dimensional data structures that only contain numerical values. These matrices are useful for arithmetic operations and other elements of calculus.

We can create a matrix by using a vector that includes the data of interest and then specifying on the function how many rows and columns are necessary.

# Creating a matrix
num.matrix <- matrix(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 3, byrow = 4)
# Checking that the matrix is indeed a matrix
class(num.matrix)
## [1] "matrix" "array"
# Extracting the first column of the matrix
num.matrix[,1]
## [1] 1 5 9

And now we can do some math:

num.matrix + 1
##      [,1] [,2] [,3] [,4]
## [1,]    2    3    4    5
## [2,]    6    7    8    9
## [3,]   10   11   12   13

Question 2

  • Create a data frame using three vectors: colors, candy, numbers. Add the code and the table.
  • What are the classes for each column?

8.3 Combining different two-dimensional data structures

R can be used to bind different data structures as long as the dimensions are compatible.

8.3.1 Combining by rows

For example, if you want to add two tables together by rows, you can use the rbind function:

# Let's create a new data frame with has the same column names as age.table but with different data
names <- c("Fran","Yeyi","Andy","John")
age <- c(31, 32, 17, 40)
age.table2 <- data.frame(names, age, stringsAsFactors = F)
# Now, we can try and combine those tables by rows:
large.age.table <- rbind(age.table, age.table2)
large.age.table
##    names age
## 1 Kelsey  22
## 2 Javier  34
## 3    Joe  15
## 4     Li  50
## 5   Fran  31
## 6   Yeyi  32
## 7   Andy  17
## 8   John  40
# We can also check if the dimensions are different than in the original age.table
dim(age.table)
## [1] 4 2
dim(age.table2)
## [1] 4 2
dim(large.age.table)
## [1] 8 2

8.3.2 Combining by columns

However, we have to be careful about combining tables. Since age.table and age.table2 have the same dimensions, you can combine them by columns using cbind:

cbind(age.table, age.table2)
##    names age names age
## 1 Kelsey  22  Fran  31
## 2 Javier  34  Yeyi  32
## 3    Joe  15  Andy  17
## 4     Li  50  John  40

But this table would not be very useful in the future as it has repeated column names and that can be a problem for R.

Question 3

  • Create a new data frame with three columns of your liking and combine them with the larger age table. Add the code and the dimensions
  • Try to add an additional column that has a different length. What happens? Explain to the best of your knowledge

8.3.3 Reading in data tables

Finally, we will learn how to read various types of data tables.

8.3.3.1 Reading text delimited files (.txt or .tsv)

Text delimited files are a simple type of file that contains any type of simplified, unformatted text.

This means that the text does not have different fonts, colors, styles, ect.

Its just plain and literal text.

In the case of data frames, these files separate each column by a tab (The character that appears in the screen when you press the tab key in your keyboard). Tabs are different than spaces! This means, then, that a word that says hello how are you is separated by spaces (That in coding langage are known as \s) while two columns are separated by the character \t, which is how a tab is seen by the computer.

For example, a table like this:

Name Number of cats
Javier 3
Jackie 1
Brenda 0
Bea 6

Will look, to you, as

Name    Number of cats
Javier  3
Jackie  1
Brenda  0
Bea 6

Will look (to the computer) like this:

Name\tNumber\sof\scats
Javier\t3
Jackie\t1
Brenda\t0
Bea\t6

Do we see the difference?

So, to read these data tables separated by tabs, we use the command read.table

read.table(file = "Lab_5/data_table.txt", sep = "\t", header = T)
##     Name Number.of.cats
## 1 Javier              3
## 2 Jackie              1
## 3 Brenda              0
## 4    Bea              6

Question 4

  • Download the table from here and read it into R. Add the code.
  • What do the options sep and header mean? (Remember to use ?read.table to help you when you have no idea what a function does)
  • What happens if I remove the sep and header options?
  • What are the classes and lengths of each column? Add the code

8.3.3.2 Reading comma-separated value files (.csv)

Comma-separated value files (CSV files) are fancy versions of tab separated text files but that have commas instead of tabs:

Name,Number of cats
Javier,3
Jackie,1
Brenda,0
Bea,6

CSV’s are super common in sciences and data management. Some of you have noticed that when you copy the data from Google Sheets to your text file it creates commas to separate the columns.

To read CSV you can use the read.csv function

Question 5

  • Download the table from here and read it into R using the read.csv function. Read the help and add the code and an explanation of the code syntax
  • Is this table any different from the table in question 4?

8.3.3.3 Reading Excel files (.xlsx)

We talked in class how Excel files are not necessarily the best but in most cases they are super important to store information, specially for people that are not well versed in computational stuff.

Well, we can also read those files using the read_xlsx command from the package readxl

Question 6

  • Check your datasets chapter and based on this info, install the readxl package in your R session
  • Download the excel spreadsheet from here
  • Using the read_xlsx command, read the cats spreadsheet into one object and the dogs spreadsheet in a different one. Both (Check your spreadsheet in excel or google sheets before hand) - Is this table any different from the table in questions 4 and 5?