[1] "hello world"
tidyverse
for data analysisremixed from Claus O. Wilke’s SDS375 course and Andrew P. Bray’s quarto workshop
Workshop materials are here:
RStudio and the Quarto notebook
Loading and writing tabular data
Data wrangling and make plots with the tidyverse
Tables and statistics
Create a project for today’s workshop and download the data.
workshop_2
folder name your directory and click “Create Project”.qmd
extensionPackages are a collection of functions and objects that are shared for free to use.
In the console, you can type e.g. install.packages("tidyverse")
to install most R packages.
Sometimes R packages need to be installed a different way, and the documentation of the package will tell you how.
Then, to load a package, add library("tidyverse")
in a code chunk (usually in the first code cell of your document)
You can quickly insert chunks like these into your file with
```{r}
```
Example chunk:
You can use <-
or =
to assign values to variables
We will use <-
for all examples going forward.
A lot of R people use .
inside variable names, but in most languages besides R this would be an error. It’s good practice these days to use the _
underscore if you want separation in your variable names.
Functions are named bits of code that take parameters as input and return some output
[1] "hello world"
str_c
is a function that puts concatenates strings.
functions can have named parameters as well as positional parameters.
named parameters always take an =
sign for assignment.
Type ?str_c in the console to get a help page. check out this guide on how to read the R help pages.
Google! Add “tidyverse” to search queries to get more relevant results.
phind.com and chat.deepseek.com are good free AI services for getting help with code.
The type of the value can be
Quick live demo of doing some work in R
05:00
Bacterial growth measurements with different species
Measured as an optical density (OD) of the culture at the end of the experiment
Growth measured with different concentrations of different long-chain fatty acids added to the media
Download the data from kwondry.github.io/documentation/r-tutorial. Put it in a folder called data
inside your R project folder.
codebook:
plate_id
– an identifier of which plate was measured. row
– row of the plate1
column
– column of the plate
bug
– the isolate/species that was tested in this well
condition
- which long chain fatty acid (LCFA) was added
conc
- what was the concentration of the LCFA in this well?
row
– row of the plate
column
– column of the plate
od
- the optical density (OD600) that was measured in this well. This is a measure of bacterial growth.
Data is often in tables, and the easiest way to store tabular data is in csv
or tsv
format.
csv
- comma separated values
tsv
- tab separated values
to read in data stored this way use read_csv(filename)
or read_tsv(filename)
bind
ing tables togetherWe have data from 4 different plates in separate csv
files. Use bind_rows
to make a single table with all the data.
join
ing metadata to the dataConnect the metadata to the plate reader data using left_join
tibble
s (aka data frames)tibble
s are the big reason R is great for working with tabular data.
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
# A tibble: 384 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.206 1 a 1 CTRL CTRL 0
2 0.171 1 a 2 crispatus CTRL 0
3 0.136 1 a 3 crispatus CTRL 0
4 0.131 1 a 4 crispatus CTRL 0
5 0.137 1 a 5 jensenii CTRL 0
6 0.14 1 a 6 jensenii CTRL 0
7 0.144 1 a 7 jensenii CTRL 0
8 0.126 1 a 8 iners CTRL 0
9 0.13 1 a 9 iners CTRL 0
10 0.127 1 a 10 iners CTRL 0
# ℹ 374 more rows
Pick rows: filter()
Pick columns: select()
Sort rows: arrange()
Count things: count()
Make new columns: mutate()
%>%
or |>
feeds data into functions# A tibble: 6 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.206 1 a 1 CTRL CTRL 0
2 0.171 1 a 2 crispatus CTRL 0
3 0.136 1 a 3 crispatus CTRL 0
4 0.131 1 a 4 crispatus CTRL 0
5 0.137 1 a 5 jensenii CTRL 0
6 0.14 1 a 6 jensenii CTRL 0
%>%
or |>
feeds data into functions# A tibble: 6 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.206 1 a 1 CTRL CTRL 0
2 0.171 1 a 2 crispatus CTRL 0
3 0.136 1 a 3 crispatus CTRL 0
4 0.131 1 a 4 crispatus CTRL 0
5 0.137 1 a 5 jensenii CTRL 0
6 0.14 1 a 6 jensenii CTRL 0
%>%
or |>
feeds data into functionsfilter()
# A tibble: 48 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.206 1 a 1 CTRL CTRL 0
2 0.171 1 a 2 crispatus CTRL 0
3 0.136 1 a 3 crispatus CTRL 0
4 0.131 1 a 4 crispatus CTRL 0
5 0.137 1 a 5 jensenii CTRL 0
6 0.14 1 a 6 jensenii CTRL 0
7 0.144 1 a 7 jensenii CTRL 0
8 0.126 1 a 8 iners CTRL 0
9 0.13 1 a 9 iners CTRL 0
10 0.127 1 a 10 iners CTRL 0
# ℹ 38 more rows
# A tibble: 20 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.534 1 b 7 jensenii OA 400
2 0.61 1 f 5 jensenii VCA 400
3 0.626 1 f 6 jensenii VCA 400
4 0.627 1 f 7 jensenii VCA 400
5 0.747 3 b 2 gasseri OA 400
6 0.769 3 b 3 gasseri OA 400
7 0.75 3 b 4 gasseri OA 400
8 0.652 3 b 7 vaginalis OA 400
9 0.631 3 c 2 gasseri OA 200
10 0.607 3 c 3 gasseri OA 200
11 0.624 3 c 4 gasseri OA 200
12 0.908 3 f 2 gasseri VCA 400
13 0.924 3 f 3 gasseri VCA 400
14 0.867 3 f 4 gasseri VCA 400
15 0.73 3 g 2 gasseri VCA 200
16 0.764 3 g 3 gasseri VCA 200
17 0.725 3 g 4 gasseri VCA 200
18 0.688 4 b 2 gasseri VCA 100
19 0.636 4 b 3 gasseri VCA 100
20 0.591 4 b 4 gasseri VCA 100
select()
plate_id
, and od
arrange()
# A tibble: 384 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.113 1 h 11 CTRL CTRL 0
2 0.113 1 h 12 CTRL CTRL 0
3 0.113 2 h 11 CTRL CTRL 0
4 0.114 1 e 12 CTRL OA 50
5 0.114 1 h 1 CTRL CTRL 0
6 0.114 2 h 1 CTRL CTRL 0
7 0.114 2 h 12 CTRL CTRL 0
8 0.115 1 d 12 CTRL OA 100
9 0.115 2 a 11 CTRL CTRL 0
10 0.115 2 h 5 jensenii CTRL 0
# ℹ 374 more rows
# A tibble: 384 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.924 3 f 3 gasseri VCA 400
2 0.908 3 f 2 gasseri VCA 400
3 0.867 3 f 4 gasseri VCA 400
4 0.769 3 b 3 gasseri OA 400
5 0.764 3 g 3 gasseri VCA 200
6 0.75 3 b 4 gasseri OA 400
7 0.747 3 b 2 gasseri OA 400
8 0.73 3 g 2 gasseri VCA 200
9 0.725 3 g 4 gasseri VCA 200
10 0.688 4 b 2 gasseri VCA 100
# ℹ 374 more rows
To demonstrate counting, let’s switch to metadata
# A tibble: 384 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.206 1 a 1 CTRL CTRL 0
2 0.171 1 a 2 crispatus CTRL 0
3 0.136 1 a 3 crispatus CTRL 0
4 0.131 1 a 4 crispatus CTRL 0
5 0.137 1 a 5 jensenii CTRL 0
6 0.14 1 a 6 jensenii CTRL 0
7 0.144 1 a 7 jensenii CTRL 0
8 0.126 1 a 8 iners CTRL 0
9 0.13 1 a 9 iners CTRL 0
10 0.127 1 a 10 iners CTRL 0
# ℹ 374 more rows
# A tibble: 91 × 4
condition conc bug n
<chr> <dbl> <chr> <int>
1 CTRL 0 CTRL 24
2 CTRL 0 crispatus 12
3 CTRL 0 gasseri 12
4 CTRL 0 iners 12
5 CTRL 0 jensenii 12
6 CTRL 0 piotii 12
7 CTRL 0 vaginalis 12
8 LNA 50 CTRL 6
9 LNA 50 crispatus 3
10 LNA 50 gasseri 3
# ℹ 81 more rows
# A tibble: 48 × 7
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.171 1 a 2 crispatus CTRL 0
2 0.136 1 a 3 crispatus CTRL 0
3 0.131 1 a 4 crispatus CTRL 0
4 0.453 1 b 2 crispatus OA 400
5 0.478 1 b 3 crispatus OA 400
6 0.416 1 b 4 crispatus OA 400
7 0.383 1 c 2 crispatus OA 200
8 0.401 1 c 3 crispatus OA 200
9 0.37 1 c 4 crispatus OA 200
10 0.303 1 d 2 crispatus OA 100
# ℹ 38 more rows
```{r}
joined_data %>%
filter(bug == "crispatus") %>%
filter(conc > 50) %>%
select(plate_id, bug, condition)
```
# A tibble: 27 × 3
plate_id bug condition
<dbl> <chr> <chr>
1 1 crispatus OA
2 1 crispatus OA
3 1 crispatus OA
4 1 crispatus OA
5 1 crispatus OA
6 1 crispatus OA
7 1 crispatus OA
8 1 crispatus OA
9 1 crispatus OA
10 1 crispatus VCA
# ℹ 17 more rows
mutate()
The conc
column is in units of uM. What if you needed it in mM? What’s the calculation?
The conc
column is in units of uM. What if you needed it in mM? What’s the calculation?
To get mM you would divide by 1000.
The conc
column is in units of uM. What if you needed it in mM? What’s the calculation?
To get mM you would divide by 1000.
# A tibble: 5 × 7
plate_id row column bug condition conc conc_mM
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 2 c 6 jensenii VCA 50 0.05
2 1 g 8 iners VCA 200 0.2
3 1 b 9 iners OA 400 0.4
4 4 f 3 gasseri LNA 100 0.1
5 4 h 3 gasseri CTRL 0 0
# A tibble: 5 × 8
plate_id row column bug condition conc conc_mM conc_nM
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2 d 10 iners LNA 400 0.4 400000
2 4 b 8 piotii VCA 100 0.1 100000
3 2 a 12 CTRL CTRL 0 0 0
4 3 h 6 vaginalis CTRL 0 0 0
5 2 f 4 crispatus LNA 100 0.1 100000
15:00
Write code to answer the following questions?
How many different concentrations of LCFA are tested?
How many different LCFAs are tested on each plate?
What bug has the highest OD seen in all the plates?
What bug has the highest OD when no LCFA is added?
What control well with no bug and no LCFA has the highest OD?
15:00
od | plate_id | row | column | bug | condition | conc |
---|---|---|---|---|---|---|
0.206 | 1 | a | 1 | CTRL | CTRL | 0 |
0.171 | 1 | a | 2 | crispatus | CTRL | 0 |
0.136 | 1 | a | 3 | crispatus | CTRL | 0 |
0.131 | 1 | a | 4 | crispatus | CTRL | 0 |
0.137 | 1 | a | 5 | jensenii | CTRL | 0 |
0.140 | 1 | a | 6 | jensenii | CTRL | 0 |
Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019
Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019
ggplot
aes()
age | sex | class | survived |
---|---|---|---|
0.17 | female | 3rd | survived |
0.33 | male | 3rd | died |
0.80 | male | 2nd | survived |
0.83 | male | 2nd | survived |
0.83 | male | 3rd | survived |
0.92 | male | 1st | survived |
1.00 | female | 2nd | survived |
1.00 | female | 3rd | survived |
1.00 | male | 2nd | survived |
1.00 | male | 2nd | survived |
1.00 | male | 3rd | survived |
1.50 | female | 3rd | died |
age | sex | class | survived |
---|---|---|---|
1.5 | female | 3rd | died |
2.0 | female | 1st | died |
2.0 | female | 2nd | survived |
2.0 | female | 3rd | died |
2.0 | female | 3rd | died |
2.0 | male | 2nd | survived |
2.0 | male | 2nd | survived |
2.0 | male | 2nd | survived |
3.0 | female | 2nd | survived |
3.0 | female | 3rd | survived |
3.0 | male | 2nd | survived |
3.0 | male | 2nd | survived |
age | sex | class | survived |
---|---|---|---|
3 | male | 3rd | survived |
3 | male | 3rd | survived |
4 | female | 2nd | survived |
4 | female | 2nd | survived |
4 | female | 3rd | survived |
4 | female | 3rd | survived |
4 | male | 1st | survived |
4 | male | 3rd | died |
4 | male | 3rd | survived |
5 | female | 3rd | survived |
5 | female | 3rd | survived |
5 | male | 3rd | died |
geom_histogram()
Do you like where there bins are? What does the first bin say?
center
as well, to half the bin_width
Setting center 2.5 makes the bars start 0-5, 5-10, etc. instead of 2.5-7.5, etc. You could instead use the argument boundary=5
to accomplish the same behavior.
geom_density()
geom_density()
without fill
A boxplot is a crude way of visualizing a distribution.
A violin plot is a density plot rotated 90 degrees and then mirrored.
ggplot2
Plot type | Geom | Notes |
---|---|---|
boxplot | geom_boxplot() |
|
violin plot | geom_violin() |
|
strip chart | geom_point() |
Jittering requires position_jitter() |
sina plot | geom_sina() |
From package ggforce |
scatter-density plot | geom_quasirandom() |
From package ggbeeswarm |
ridgeline | geom_density_ridges() |
From package ggridges |
geom_quasirandom
10:00
Get with a group of 2-3 people. Go to the activity and pick an option to do together.
group_by()
and summarize()
Previously we used count
, now we group
the data
# A tibble: 384 × 7
# Groups: bug, conc, condition [91]
od plate_id row column bug condition conc
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 0.206 1 a 1 CTRL CTRL 0
2 0.171 1 a 2 crispatus CTRL 0
3 0.136 1 a 3 crispatus CTRL 0
4 0.131 1 a 4 crispatus CTRL 0
5 0.137 1 a 5 jensenii CTRL 0
6 0.14 1 a 6 jensenii CTRL 0
7 0.144 1 a 7 jensenii CTRL 0
8 0.126 1 a 8 iners CTRL 0
9 0.13 1 a 9 iners CTRL 0
10 0.127 1 a 10 iners CTRL 0
# ℹ 374 more rows
Previously we used count
, now we group
the data, and then summarise
```{r}
joined_data %>%
group_by(bug, conc, condition) %>%
summarise(
n = n() # n() returns the number of observations per group
)
```
# A tibble: 91 × 4
# Groups: bug, conc [35]
bug conc condition n
<chr> <dbl> <chr> <int>
1 CTRL 0 CTRL 24
2 CTRL 50 LNA 6
3 CTRL 50 OA 6
4 CTRL 50 VCA 6
5 CTRL 100 LNA 6
6 CTRL 100 OA 6
7 CTRL 100 VCA 6
8 CTRL 200 LNA 6
9 CTRL 200 OA 6
10 CTRL 200 VCA 6
# ℹ 81 more rows
# A tibble: 91 × 4
# Groups: bug, conc [35]
bug conc condition median_od
<chr> <dbl> <chr> <dbl>
1 CTRL 0 CTRL 0.117
2 CTRL 50 LNA 0.118
3 CTRL 50 OA 0.118
4 CTRL 50 VCA 0.119
5 CTRL 100 LNA 0.120
6 CTRL 100 OA 0.118
7 CTRL 100 VCA 0.119
8 CTRL 200 LNA 0.122
9 CTRL 200 OA 0.122
10 CTRL 200 VCA 0.122
# ℹ 81 more rows
```{r}
joined_data %>%
group_by(bug, conc, condition) %>%
summarise(
n = n(),
median_od = median(od)
)
```
# A tibble: 91 × 5
# Groups: bug, conc [35]
bug conc condition n median_od
<chr> <dbl> <chr> <int> <dbl>
1 CTRL 0 CTRL 24 0.117
2 CTRL 50 LNA 6 0.118
3 CTRL 50 OA 6 0.118
4 CTRL 50 VCA 6 0.119
5 CTRL 100 LNA 6 0.120
6 CTRL 100 OA 6 0.118
7 CTRL 100 VCA 6 0.119
8 CTRL 200 LNA 6 0.122
9 CTRL 200 OA 6 0.122
10 CTRL 200 VCA 6 0.122
# ℹ 81 more rows
Make a code block and make a variable called media_background_medians
that has one row for every combination of plate, lcfa, and conc that gives the median OD measured for those conditions.
Bonus: make a histogram of the media backgrounds (before summarising) for each condition and concentration. Try with a facet by plate and without.
Now make a variable called bug_no_lcfa_control
. Join the media_background_medians
to the joined_data
, and mutate
a column that caluclate the od
- media_background
. Filter this table so it only has the no LCFA control conditions for each bug on each plate. Then group_by
bug
and plate_id
and get the median of each background subtracted od
.
Bonus: make a histogram before summarising of the ods with background subtracted for each bug on each plate.
Make a table that has the relative growth compared to no LCFA of each bug for each concentration. (Hint: There should be three rows per condition+bug+concentration.) Make a plot showing the relative growths. (e.g. x axis is concentration, y axis relative growth, facet by bug+concentration, and pick a geom to use to show the data.)
Investigate why some relative growths are so high. Think about how you might tweak the analysis to handle that issue.
pivot_wider()
and pivot_longer()
```{r}
joined_data %>%
count(plate_id, bug, conc, condition) %>%
pivot_wider(names_from = plate_id, values_from = n)
```
# A tibble: 91 × 7
bug conc condition `1` `2` `3` `4`
<chr> <dbl> <chr> <int> <int> <int> <int>
1 CTRL 0 CTRL 6 6 6 6
2 CTRL 50 OA 3 NA 3 NA
3 CTRL 100 OA 3 NA 3 NA
4 CTRL 200 OA 3 NA 3 NA
5 CTRL 200 VCA 3 NA 3 NA
6 CTRL 400 OA 3 NA 3 NA
7 CTRL 400 VCA 3 NA 3 NA
8 crispatus 0 CTRL 6 6 NA NA
9 crispatus 50 OA 3 NA NA NA
10 crispatus 100 OA 3 NA NA NA
# ℹ 81 more rows
```{r}
joined_data %>%
count(plate_id, bug, conc, condition) %>%
pivot_wider(names_from = bug, values_from = n)
```
# A tibble: 28 × 10
plate_id conc condition CTRL crispatus iners jensenii gasseri piotii
<dbl> <dbl> <chr> <int> <int> <int> <int> <int> <int>
1 1 0 CTRL 6 6 6 6 NA NA
2 1 50 OA 3 3 3 3 NA NA
3 1 100 OA 3 3 3 3 NA NA
4 1 200 OA 3 3 3 3 NA NA
5 1 200 VCA 3 3 3 3 NA NA
6 1 400 OA 3 3 3 3 NA NA
7 1 400 VCA 3 3 3 3 NA NA
8 2 0 CTRL 6 6 6 6 NA NA
9 2 50 LNA 3 3 3 3 NA NA
10 2 50 VCA 3 3 3 3 NA NA
# ℹ 18 more rows
# ℹ 1 more variable: vaginalis <int>
The differences are all about how to handle when the two tables have different key values
left_join()
- the resulting table always has the same key_values as the “left” table
right_join()
- the resulting table always has the same key_values as the “right” table
inner_join()
- the resulting table always only keeps the key_values that are in both tables
full_join()
- the resulting table always has all key_values found in both tables
left_join()
- the resulting table always has the same key_values as the “left” table
inner_join()
- the resulting table always only keeps the key_values that are in both tables