Chapter 6 Creating datasets and replicability
Objectives:
- To use the scientific method for data analysis
- To generate research questions based on our observations
- To create data sets based on metrics of interest
- To open and use data sets in a computational environment
Today’s laboratory will be focused in the essentials of building data sets.
Data sets can be built with absolutely anything you can make an observation from and ask a question about.
Before we start, however, lets answer these questions:
Question 1
- What are the main steps of the scientific method?
- How do you define a research question?
- What is the difference between a research question and a hypothesis?
6.1 Creating basic datasets
So, we have a bag of candy in front of us.
Bags of candy have several different elements on it: The various flavors, the diversity in colors. Some may have different shapes, ect.
The objective for today is to ask research questions and create a dataset that allows you to answer different questions based on your research focus (in this case, your candy bag)
6.2 Asking biological questions and identifying measurable outcomes
- Create groups of two students
- Open the bag of candy and generate some observations about it. It can be about anything you find interesting or peaks your interest
Question 2
Create at least three observations from tour research object and add them here:
- Choose one observation and create a research question and a hypothesis for it
Question 3
-
Research Question:
-
Hypothesis
Using a physical notebook, have a list of variables you think would be directly measurable to test your hypothesis
Measure the variables and write down the results in the same physical notebook.
Question 4
Take a photo of the page of the notebook and add it here.
- Did your results answer your research question? How about your hypothesis?
Question 5
-
Research Question:
-
Hypothesis
6.3 Creating a clean raw dataset
Using the same dataset, organize your data in a simple data frame using Google Sheets or EtherCalc
Copy your spreadsheet into
atom
, paste it and save it file into a folder in your desktop calledBIOL120
with atxt
extension (something likedata_sheet_candy.txt
)
WARNING:
Do NOT save any of your files with a space or with weird characters (such as ?><,/’;:=+-
).
If you need to use a space, use an underscore (_
) or separate the names using camel case
(i.e. Instead of data sheet candy.txt
use either data_sheet_candy.txt
or dataSheetCandy.txt
)
In your
R studio
, go to theSession
menu in the toolbar, selectSet Working Directory
and select yourBIOL120
folderRun the following code after changing the name of the .csv file into what you named it
<- read.table('data_sheet_candy.txt', header = T) my_data
i.e. If my file is called
dataset.txt
, then I change the code toread.table('dataset.txt')
Executable Code Chunks
Executable code chunks are sections in your R markdown that allow you to execute or run code inside your file! That means you can do a lot of cool stuff within those chunks, like read datasets, create tables, and plot figures.
Executable code chunks look like this:
```{coding language goes here}
code goes here
```
That means we can use bash
, R
and even python
(as long as you have installed the correct packages) inside your R markdown document.
An example using bash
to get the time:
```{bash}
date
```
## Mon Nov 21 21:10:49 EST 2022
You can also use R
as in here:
```{r}
(my_data <- read.table('data_sheet_candy.txt', header = T))
```
## Flavor Number
## 1 Lime 12
## 2 Lemon 13
## 3 Banana 55
## 4 Strawberries 23
- Check your file loaded by using the code
View(my_data)
or also adding this into your notebook andkniting
the document:
my_data
## Flavor Number
## 1 Lime 12
## 2 Lemon 13
## 3 Banana 55
## 4 Strawberries 23
- In your R markdown notebook do the following:
Question 6
- Add the code to read and visualize the data frame
- What are the names of your columns?
- What do each of you columns represent?
- Write if your results are able to test your hypothesis and give an explanation why
6.4 Creating basic (basic) plots in R
Now that you have your data set readable by a computer program, we can do some very basic visualization to make it easier to interpret.
To do so, we need to install one simple R package
that will allow us to create plots. This package is called ggplot
.
R Packages
An R package are a collection of code, data, functions and other elements that allow you to extend the usage of R
.
Some packages will help you plot, some others will allow you to read data into R in a better and faster way.
To use a package the first thing you have to do is Install it. After the package is installed, you just need to load it into the library.
Installing a package
For example, to install ggplot2
, write the following code just once in the CONSOLE (aka. The window in the lower left corner of your R studio)
install.packages(‘ggplot2’)
If you check the console, you’ll see that the code will say
DONE: ggplot2
That means that your package has been installed.
Loading a package
Now, we need to load the package to get the functions to work.
To load the package, in your R markdown
add this in code block
{r} library(ggplot2)
The command library
loads the package into your R
environment, and now you can use any of its functions!
To learn more about packages in R
, read the following link
Load the
ggplot2
package (make sure you have installed it first)Make sure your data frame is still loaded into
R
.
Question 7
- How would you check that the data frame is still loaded in R? Write the code and a justification
- Now, lets do a basic plot. Chances are that your data set has a set of columns with the colors/flavors/shapes in one column and the number of said elements in a second column.
If this is the case, you can create a simple bar plot. A bar plot summarizes all the information for each category (i.e. The element you are using to qualify the data into different sets) in the x-axis and creates bars in the y-axis where the height represents the number of data points for said category.
To create the plot, modify and execute the following code:
library(ggplot2)
ggplot(data=my_data, aes(x=Flavor, y=Number)) + geom_bar(stat='identity')
Question 8
- Is this an easier way of comparing your data? Explain why.
6.5 Comparisons between groups.
Now, lets see how our data collection is compared to the other groups
- Get together with another group
- Compare your two data collections and answer the following:
Question 9
- What was the question the other group asked?
- Did they data collection help them answer this question?
- Import their data set into your
R
environment. That means to ask them for their original, hand written data sheet and re-digitize it usingGoogle Sheets
orEtherCalc
Question 10
-
Load the data sheet into
R
and show it in your Markdown notebook - Compare it to their digital data sheet. Is it identical?
- Now, talk to the other group and start thinking of a way to measure the differences between the two data sets.
Asking questions
For example: One group has M&Ms and the other has Skittles. What are two features in common these two candies have? Color? Flavors?
You can, for example, measure the number of colors for both Skittles and M&M’s and compare them
Question 11
-
Research Question:
-
Hypothesis
- Create a digital data frame with these two comparison and add them to
R
Question 12
-
Load the data sheet into
R
and show it in your Markdown notebook
- Lets create a plot to compare the two different candies
Comparing data sets
When you want to compare between different sites/treatments/types of candies is important that the data frame has an extra column that tell you where these are from.
For example:
Flavor | Number | Candy |
---|---|---|
Lime | 12 | Skittles |
Lemon | 13 | Skittles |
Banana | 55 | Skittles |
Strawberries | 23 | Skittles |
Lime | 5 | Starburst |
Lemon | 0 | Starburst |
Banana | 12 | Starburst |
Strawberries | 3 | Starburst |
In this case, we can separate the flavors from each candy.
Modify the following code with your column names:
ggplot(data=two_candies, aes(x=Flavor, y=Number, fill=Candy)) +
geom_bar(stat='identity', position="dodge")
Question 13
- Can this graph help you answer your research question? How
- What do your results show, based on the research question?
-
Compare the code for plotting the
two_candies
data set versus the code for themy_data
data set. What changed?