Chapter 8 Basic R (2) : Data frames
Objectives:
- To understand what a data frame is
- To identify and extract rows and columns
- To create new data frames based on vectors
- To combine different data frames
- To learn how to read different data frame types
There are objects that will have rows and columns, such as data frames and matrices. These are two-dimensional objects that we will use the blanket term Tables.
8.1 Creating tables using vectors
We can create tables using vectors
# This is a table with two columns and four rows.
# Columns are Names and Age
# Rows are the age of each person:
<- c("Kelsey","Javier","Joe","Li")
names <- c(22, 34, 15, 50)
age # Names is a character vector and age is a numeric/integer vector, so we can put them together in a data frame:
<- data.frame(names, age, stringsAsFactors = F)
age.table age.table
## names age
## 1 Kelsey 22
## 2 Javier 34
## 3 Joe 15
## 4 Li 50
or loading existing data as we did in Chapter 4
To see if our new object is correct, we can use the class
and the str
command. str
means structure, and will return the structure of the table:
# We expect the table to be a data frame class
class(age.table)
## [1] "data.frame"
# We expect the structure of the table to be one character column, and one numeric column
str(age.table)
## 'data.frame': 4 obs. of 2 variables:
## $ names: chr "Kelsey" "Javier" "Joe" "Li"
## $ age : num 22 34 15 50
We should also make sure that our table has the expected dimensions (i.e. Four rows and two columns), We can check this using the dim
command:
dim(age.table)
## [1] 4 2
Note that it tells you the number of rows first, and then the number of columns. R always uses this order to deal with tables, so rows first and then columns.
8.1.1 Accessing specific rows and columns
This in integrated further on the way that R handles tables. using the name of the table followed by square brackets will allow you to access rows or columns.
Rows can be accessed on the left side of the square brackets. Use the number of the row you want to access it. So, if you want to access the first row of your age.table
data frame, use the following code:
1, ] age.table[
## names age
## 1 Kelsey 22
Similar syntax can be used to access the columns, but use the right side of the square brackets:
1 ] age.table[,
## [1] "Kelsey" "Javier" "Joe" "Li"
Last last thing: You can call each column as a vector using the $
command (in data frames)
$names age.table
## [1] "Kelsey" "Javier" "Joe" "Li"
$age age.table
## [1] 22 34 15 50
Question 1
-
Create a data frame using three vectors:
colors
,candy
,numbers
. Add the code and the table. - What are the classes for each column?
8.2 Data frames versus matrices
Data frames and matrices are two-dimensional data structures used in R.
Data frames, as shown by the age.table
example, can hold on to many different types of one dimensional data structures (such as mix of numeric, character or factor vectors).
# Checking that our data frame IS a data frame
class(age.table)
## [1] "data.frame"
Matrices are numerical two dimensional data structures that only contain numerical values. These matrices are useful for arithmetic operations and other elements of calculus.
We can create a matrix by using a vector that includes the data of interest and then specifying on the function how many rows and columns are necessary.
# Creating a matrix
<- matrix(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 3, byrow = 4)
num.matrix # Checking that the matrix is indeed a matrix
class(num.matrix)
## [1] "matrix" "array"
# Extracting the first column of the matrix
1] num.matrix[,
## [1] 1 5 9
And now we can do some math:
+ 1 num.matrix
## [,1] [,2] [,3] [,4]
## [1,] 2 3 4 5
## [2,] 6 7 8 9
## [3,] 10 11 12 13
Question 2
-
Create a data frame using three vectors:
colors
,candy
,numbers
. Add the code and the table. - What are the classes for each column?
8.3 Combining different two-dimensional data structures
R can be used to bind different data structures as long as the dimensions are compatible.
8.3.1 Combining by rows
For example, if you want to add two tables together by rows, you can use the rbind
function:
# Let's create a new data frame with has the same column names as age.table but with different data
<- c("Fran","Yeyi","Andy","John")
names <- c(31, 32, 17, 40)
age <- data.frame(names, age, stringsAsFactors = F)
age.table2 # Now, we can try and combine those tables by rows:
<- rbind(age.table, age.table2)
large.age.table large.age.table
## names age
## 1 Kelsey 22
## 2 Javier 34
## 3 Joe 15
## 4 Li 50
## 5 Fran 31
## 6 Yeyi 32
## 7 Andy 17
## 8 John 40
# We can also check if the dimensions are different than in the original age.table
dim(age.table)
## [1] 4 2
dim(age.table2)
## [1] 4 2
dim(large.age.table)
## [1] 8 2
8.3.2 Combining by columns
However, we have to be careful about combining tables. Since age.table
and age.table2
have the same dimensions, you can combine them by columns using cbind
:
cbind(age.table, age.table2)
## names age names age
## 1 Kelsey 22 Fran 31
## 2 Javier 34 Yeyi 32
## 3 Joe 15 Andy 17
## 4 Li 50 John 40
But this table would not be very useful in the future as it has repeated column names and that can be a problem for R.
Question 3
- Create a new data frame with three columns of your liking and combine them with the larger age table. Add the code and the dimensions
- Try to add an additional column that has a different length. What happens? Explain to the best of your knowledge
8.3.3 Reading in data tables
Finally, we will learn how to read various types of data tables.
8.3.3.1 Reading text delimited files (.txt
or .tsv
)
Text delimited files are a simple type of file that contains any type of simplified, unformatted text.
This means that the text does not have different fonts, colors, styles, ect.
Its just plain and literal text.
In the case of data frames, these files separate each column by a tab
(The character that appears in the screen when you press the tab key in your keyboard). Tabs are different than spaces! This means, then, that a word that says hello how are you
is separated by spaces (That in coding langage are known as \s
) while two columns are separated by the character \t
, which is how a tab is seen by the computer.
For example, a table like this:
Name | Number of cats |
---|---|
Javier | 3 |
Jackie | 1 |
Brenda | 0 |
Bea | 6 |
Will look, to you, as
Name Number of cats
Javier 3
Jackie 1
Brenda 0
Bea 6
Will look (to the computer) like this:
Name\tNumber\sof\scats
Javier\t3
Jackie\t1
Brenda\t0
Bea\t6
Do we see the difference?
So, to read these data tables separated by tabs, we use the command read.table
read.table(file = "Lab_5/data_table.txt", sep = "\t", header = T)
## Name Number.of.cats
## 1 Javier 3
## 2 Jackie 1
## 3 Brenda 0
## 4 Bea 6
Question 4
-
Download the table from here and read it into
R
. Add the code. -
What do the options
sep
andheader
mean? (Remember to use?read.table
to help you when you have no idea what a function does) -
What happens if I remove the
sep
andheader
options? - What are the classes and lengths of each column? Add the code
8.3.3.2 Reading comma-separated value files (.csv
)
Comma-separated value files (CSV files)
are fancy versions of tab separated text files but that have commas instead of tabs:
Name,Number of cats
Javier,3
Jackie,1
Brenda,0
Bea,6
CSV’s are super common in sciences and data management. Some of you have noticed that when you copy the data from Google Sheets to your text file it creates commas to separate the columns.
To read CSV you can use the read.csv
function
Question 5
-
Download the table from here and read it into
R
using theread.csv
function. Read the help and add the code and an explanation of the code syntax - Is this table any different from the table in question 4?
8.3.3.3 Reading Excel files (.xlsx)
We talked in class how Excel files are not necessarily the best but in most cases they are super important to store information, specially for people that are not well versed in computational stuff.
Well, we can also read those files using the read_xlsx
command from the package readxl
Question 6
-
Check your datasets chapter and based on this info, install the
readxl
package in your R session - Download the excel spreadsheet from here
-
Using the
read_xlsx
command, read thecats
spreadsheet into one object and thedogs
spreadsheet in a different one. Both (Check your spreadsheet in excel or google sheets before hand) - Is this table any different from the table in questions 4 and 5?