`tidyverse` for data analysis

remixed from Claus O. Wilke’s SDS375 course and Andrew P. Bray’s quarto workshop

Workshop materials are here:

https://kwondry.github.io/documentation/

Goals for this session

RStudio and the Quarto notebook
Loading and writing tabular data
Data wrangling and make plots with the tidyverse
Tables and statistics

Create an R Project

Create a project for today’s workshop and download the data.

Create an R Project

File -> New Project…

Create an R Project

Click on New Directory

Create an R Project

Navigate to the workshop_2 folder name your directory and click “Create Project”

Create an R Project

You made a project! This creates a file for you with the .qmd extension

Create an R Project

Switch from “visual” to “source” to see the plain-text version of this document.

Create an R Project

Click on “Render” to ask Quarto to turn this plain-text document into an HTML page

Create an R Project

Your default web-browser will open and show you the rendered document!

RStudio

What are the parts of RStudio?

The text editor

The console

The right panes

Installing and loading packages

Packages are a collection of functions and objects that are shared for free to use.

In the console, you can type e.g. install.packages("tidyverse") to install most R packages.

Sometimes R packages need to be installed a different way, and the documentation of the package will tell you how.

Then, to load a package, add library("tidyverse") in a code chunk (usually in the first code cell of your document)

Quarto’s Code Chunk

You can quickly insert chunks like these into your file with

the keyboard shortcut Ctrl + Alt + I (OS X: Cmd + Option + I)
the Add Chunk command in the editor toolbar
or by typing the chunk delimiters
```
```{r}
```
```
```
```

Example chunk:

```{r}
10 + 5
```

Try writing a math expression in a code chunk, and hitting the button or Ctrl + Alt + Enter (OS X: Cmd + Option + Enter) to run the chunk.

Writing code: assigning variables

You can use <- or = to assign values to variables

a <- 6
b = 8
c <- 5.44
d = TRUE
e = "hello world" 
e <- 'hello world' # same as double quote

We will use <- for all examples going forward.

Naming variables

A lot of R people use . inside variable names, but in most languages besides R this would be an error. It’s good practice these days to use the _ underscore if you want separation in your variable names.

r.people.sometimes.put.dots <- TRUE
dots.are.confusing <- "maybe"
just_use_underscores <- "please"

Functions

Functions are named bits of code that take parameters as input and return some output

library(tidyverse)
word_1 <- "hello"
word_2 <- "world"
str_c(word_1, word_2, sep = " ")

[1] "hello world"

str_c is a function that puts concatenates strings.

functions can have named parameters as well as positional parameters.
named parameters always take an = sign for assignment.

Getting help with functions

Type ?str_c in the console to get a help page. check out this guide on how to read the R help pages.
Google! Add “tidyverse” to search queries to get more relevant results.
phind.com and chat.deepseek.com are good free AI services for getting help with code.

Value types in R

The type of the value can be

# numeric
c(1,2,3,4) 

# character
c("a","b","c","d")

# boolean
c(TRUE, FALSE)

# factor
c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") %>% as_factor()

Examples of coding

Quick live demo of doing some work in R

Assigning variables
Vectors are magic
Functions and getting help

05:00

Test data for today

Bacterial growth measurements with different species
Measured as an optical density (OD) of the culture at the end of the experiment
Growth measured with different concentrations of different long-chain fatty acids added to the media

Get the data

Download the data from kwondry.github.io/documentation/r-tutorial. Put it in a folder called data inside your R project folder.

codebook:

plate_id – an identifier of which plate was measured. row – row of the plate1
column – column of the plate
bug – the isolate/species that was tested in this well
condition - which long chain fatty acid (LCFA) was added
conc - what was the concentration of the LCFA in this well?
row – row of the plate
column – column of the plate
od - the optical density (OD600) that was measured in this well. This is a measure of bacterial growth.

Read in data

Data is often in tables, and the easiest way to store tabular data is in csv or tsv format.

csv - comma separated values
tsv - tab separated values

to read in data stored this way use read_csv(filename) or read_tsv(filename)

metadata <- read_csv("data/metadata.csv")

`bind`ing tables together

We have data from 4 different plates in separate csv files. Use bind_rows to make a single table with all the data.

plate_1 <- read_csv("data/plate_1_data.csv")
plate_2 <- read_csv("data/plate_2_data.csv")
plate_3 <- read_csv("data/plate_3_data.csv")
plate_4 <- read_csv("data/plate_4_data.csv")

all_plate_data <- bind_rows(plate_1, plate_2, plate_3, plate_4)

`join`ing metadata to the data

Connect the metadata to the plate reader data using left_join

joined_data <- all_plate_data %>%
  left_join(metadata)

`tibble`s (aka data frames)

tibbles are the big reason R is great for working with tabular data.

A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).

joined_data

# A tibble: 384 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 374 more rows

Elementary data manipulations

Pick rows: filter()
Pick columns: select()
Sort rows: arrange()
Count things: count()
Make new columns: mutate()

The pipe `%>%` or `|>` feeds data into functions

```{r}
head(joined_data) # show the first few rows
```

# A tibble: 6 × 7
     od plate_id row   column bug       condition  conc
  <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
1 0.206        1 a          1 CTRL      CTRL          0
2 0.171        1 a          2 crispatus CTRL          0
3 0.136        1 a          3 crispatus CTRL          0
4 0.131        1 a          4 crispatus CTRL          0
5 0.137        1 a          5 jensenii  CTRL          0
6 0.14         1 a          6 jensenii  CTRL          0

The pipe `%>%` or `|>` feeds data into functions

```{r}
joined_data %>%
  head()
```

# A tibble: 6 × 7
     od plate_id row   column bug       condition  conc
  <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
1 0.206        1 a          1 CTRL      CTRL          0
2 0.171        1 a          2 crispatus CTRL          0
3 0.136        1 a          3 crispatus CTRL          0
4 0.131        1 a          4 crispatus CTRL          0
5 0.137        1 a          5 jensenii  CTRL          0
6 0.14         1 a          6 jensenii  CTRL          0

The pipe `%>%` or `|>` feeds data into functions

```{r}
joined_data %>%
  head() %>%
  tail(1)
```

# A tibble: 1 × 7
     od plate_id row   column bug      condition  conc
  <dbl>    <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>
1  0.14        1 a          6 jensenii CTRL          0

Picking rows or columns, and sorting

Pick rows from a table: `filter()`

Filter only the first row of plates

```{r}
joined_data %>%
  filter(row == "a")
```

# A tibble: 48 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 38 more rows

Filter out samples with OD > 0.5

```{r}
joined_data %>%
  filter(od > 0.5)
```

# A tibble: 20 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.534        1 b          7 jensenii  OA          400
 2 0.61         1 f          5 jensenii  VCA         400
 3 0.626        1 f          6 jensenii  VCA         400
 4 0.627        1 f          7 jensenii  VCA         400
 5 0.747        3 b          2 gasseri   OA          400
 6 0.769        3 b          3 gasseri   OA          400
 7 0.75         3 b          4 gasseri   OA          400
 8 0.652        3 b          7 vaginalis OA          400
 9 0.631        3 c          2 gasseri   OA          200
10 0.607        3 c          3 gasseri   OA          200
11 0.624        3 c          4 gasseri   OA          200
12 0.908        3 f          2 gasseri   VCA         400
13 0.924        3 f          3 gasseri   VCA         400
14 0.867        3 f          4 gasseri   VCA         400
15 0.73         3 g          2 gasseri   VCA         200
16 0.764        3 g          3 gasseri   VCA         200
17 0.725        3 g          4 gasseri   VCA         200
18 0.688        4 b          2 gasseri   VCA         100
19 0.636        4 b          3 gasseri   VCA         100
20 0.591        4 b          4 gasseri   VCA         100

Pick columns from a table: `select()`

Pick columns `plate_id`, and `od`

```{r}
joined_data %>%
  select(plate_id, od)
```

# A tibble: 384 × 2
   plate_id    od
      <dbl> <dbl>
 1        1 0.206
 2        1 0.171
 3        1 0.136
 4        1 0.131
 5        1 0.137
 6        1 0.14 
 7        1 0.144
 8        1 0.126
 9        1 0.13 
10        1 0.127
# ℹ 374 more rows

Rename columns and subset with select

```{r}
joined_data %>%
  select(plate_id, optical_density=od)
```

# A tibble: 384 × 2
   plate_id optical_density
      <dbl>           <dbl>
 1        1           0.206
 2        1           0.171
 3        1           0.136
 4        1           0.131
 5        1           0.137
 6        1           0.14 
 7        1           0.144
 8        1           0.126
 9        1           0.13 
10        1           0.127
# ℹ 374 more rows

Sort the rows in a table: `arrange()`

Sort samples by OD ascending

```{r}
joined_data %>%
  arrange(od)
```

# A tibble: 384 × 7
      od plate_id row   column bug      condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>
 1 0.113        1 h         11 CTRL     CTRL          0
 2 0.113        1 h         12 CTRL     CTRL          0
 3 0.113        2 h         11 CTRL     CTRL          0
 4 0.114        1 e         12 CTRL     OA           50
 5 0.114        1 h          1 CTRL     CTRL          0
 6 0.114        2 h          1 CTRL     CTRL          0
 7 0.114        2 h         12 CTRL     CTRL          0
 8 0.115        1 d         12 CTRL     OA          100
 9 0.115        2 a         11 CTRL     CTRL          0
10 0.115        2 h          5 jensenii CTRL          0
# ℹ 374 more rows

Sort samples by OD, descending

```{r}
joined_data %>%
  arrange(desc(od))
```

# A tibble: 384 × 7
      od plate_id row   column bug     condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>   <chr>     <dbl>
 1 0.924        3 f          3 gasseri VCA         400
 2 0.908        3 f          2 gasseri VCA         400
 3 0.867        3 f          4 gasseri VCA         400
 4 0.769        3 b          3 gasseri OA          400
 5 0.764        3 g          3 gasseri VCA         200
 6 0.75         3 b          4 gasseri OA          400
 7 0.747        3 b          2 gasseri OA          400
 8 0.73         3 g          2 gasseri VCA         200
 9 0.725        3 g          4 gasseri VCA         200
10 0.688        4 b          2 gasseri VCA         100
# ℹ 374 more rows

Counting things

To demonstrate counting, let’s switch to metadata

```{r}
joined_data
```

# A tibble: 384 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 374 more rows

Counting things

```{r}
joined_data %>%
  count(bug)
```

# A tibble: 7 × 2
  bug           n
  <chr>     <int>
1 CTRL         96
2 crispatus    48
3 gasseri      48
4 iners        48
5 jensenii     48
6 piotii       48
7 vaginalis    48

Counting things

```{r}
joined_data %>%
  count(condition, conc, bug)
```

# A tibble: 91 × 4
   condition  conc bug           n
   <chr>     <dbl> <chr>     <int>
 1 CTRL          0 CTRL         24
 2 CTRL          0 crispatus    12
 3 CTRL          0 gasseri      12
 4 CTRL          0 iners        12
 5 CTRL          0 jensenii     12
 6 CTRL          0 piotii       12
 7 CTRL          0 vaginalis    12
 8 LNA          50 CTRL          6
 9 LNA          50 crispatus     3
10 LNA          50 gasseri       3
# ℹ 81 more rows

Use the pipe to build analysis pipelines

```{r}
joined_data %>%
  filter(bug == "crispatus")
```

# A tibble: 48 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.171        1 a          2 crispatus CTRL          0
 2 0.136        1 a          3 crispatus CTRL          0
 3 0.131        1 a          4 crispatus CTRL          0
 4 0.453        1 b          2 crispatus OA          400
 5 0.478        1 b          3 crispatus OA          400
 6 0.416        1 b          4 crispatus OA          400
 7 0.383        1 c          2 crispatus OA          200
 8 0.401        1 c          3 crispatus OA          200
 9 0.37         1 c          4 crispatus OA          200
10 0.303        1 d          2 crispatus OA          100
# ℹ 38 more rows

Use the pipe to build analysis pipelines

```{r}
joined_data %>%
  filter(bug == "crispatus") %>%
  filter(conc > 50) %>%
  select(plate_id, bug, condition)
```

# A tibble: 27 × 3
   plate_id bug       condition
      <dbl> <chr>     <chr>    
 1        1 crispatus OA       
 2        1 crispatus OA       
 3        1 crispatus OA       
 4        1 crispatus OA       
 5        1 crispatus OA       
 6        1 crispatus OA       
 7        1 crispatus OA       
 8        1 crispatus OA       
 9        1 crispatus OA       
10        1 crispatus VCA      
# ℹ 17 more rows

Use the pipe to build analysis pipelines

```{r}
joined_data %>%
  filter(bug == "crispatus") %>%
  filter(conc > 50) %>%
  select(plate_id, bug, condition) %>%
  count(condition)
```

# A tibble: 3 × 2
  condition     n
  <chr>     <int>
1 LNA           9
2 OA            9
3 VCA           9

Adding new columns to a table

Make a new table column: `mutate()`

Example: concentration

The conc column is in units of uM. What if you needed it in mM? What’s the calculation?

```{r}
sample_n(metadata,5)
```

# A tibble: 5 × 6
  plate_id row   column bug      condition  conc
     <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>
1        3 f          4 gasseri  VCA         400
2        3 c          9 piotii   OA          200
3        4 f          3 gasseri  LNA         100
4        4 e          4 gasseri  LNA         200
5        2 b          5 jensenii VCA         100

Example: concentration

The conc column is in units of uM. What if you needed it in mM? What’s the calculation?

To get mM you would divide by 1000.

```{r}
sample_n(metadata,5)
```

# A tibble: 5 × 6
  plate_id row   column bug      condition  conc
     <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>
1        3 f          3 gasseri  VCA         400
2        4 g          8 piotii   LNA          50
3        2 g          7 jensenii LNA          50
4        2 f         10 iners    LNA         100
5        3 a          3 gasseri  CTRL          0

Example: concentration

The conc column is in units of uM. What if you needed it in mM? What’s the calculation?

To get mM you would divide by 1000.

```{r}
sample_n(metadata,5) %>%
  mutate(conc_mM = conc/1000)
```

# A tibble: 5 × 7
  plate_id row   column bug       condition  conc conc_mM
     <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>   <dbl>
1        4 h          3 gasseri   CTRL          0     0  
2        1 h          1 CTRL      CTRL          0     0  
3        3 h          5 vaginalis CTRL          0     0  
4        1 g          4 crispatus VCA         200     0.2
5        1 d          1 CTRL      OA          100     0.1

Make multiple columns at once

```{r}
sample_n(metadata,5) %>%
  mutate(conc_mM = conc/1000, conc_nM = conc*1000)
```

# A tibble: 5 × 8
  plate_id row   column bug       condition  conc conc_mM conc_nM
     <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>   <dbl>   <dbl>
1        2 h         12 CTRL      CTRL          0    0          0
2        2 h          1 CTRL      CTRL          0    0          0
3        2 b          9 iners     VCA         100    0.1   100000
4        3 e          6 vaginalis OA           50    0.05   50000
5        4 e          5 vaginalis LNA         200    0.2   200000

Your turn to do some data wrangling

15:00

Write code to answer the following questions?

How many different concentrations of LCFA are tested?
How many different LCFAs are tested on each plate?
What bug has the highest OD seen in all the plates?
What bug has the highest OD when no LCFA is added?
What control well with no bug and no LCFA has the highest OD?

Break

15:00

Aesthetics - the elements of data visualization

Plots map data onto graphical elements.

joined_data`
od	plate_id	row	column	bug	condition
0.206	1	a	1	CTRL	CTRL
0.171	1	a	2	crispatus	CTRL
0.136	1	a	3	crispatus	CTRL
0.131	1	a	4	crispatus	CTRL
0.137	1	a	5	jensenii	CTRL
0.140	1	a	6	jensenii	CTRL

od mapped to y position

conc mapped to color

Commonly used aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

The same data values can be mapped to different aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

We can use many different aesthetics at once

Creating aesthetic mappings in `ggplot`

We define the mapping with `aes()`

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter()
```

The geom determines how the data is shown

```{r}
ggplot(joined_data, mapping = aes(x = conc, y = od, color = bug)) +
  geom_point()
```

The geom determines how the data is shown

```{r}
ggplot(joined_data, mapping = aes(x = as.factor(conc), y = od, color = bug)) +
  geom_boxplot()
```

The geom determines how the data is shown

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter()
```

Different geoms have parameters for control

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3)
```

Different geoms have parameters for control

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3, width=10)
```

Aesthetics can also be used as parameters in geoms

```{r}
#| output-location: column
ggplot(joined_data, mapping = aes(x = as.factor(conc), y = od, color = bug)) +
  geom_boxplot(fill="orange")
```

Faceting makes subplots for comparisons.

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3, width=10) +
  facet_wrap(~condition)
```

Faceting makes subplots for comparisons.

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3, width=10) +
  facet_wrap(~condition+bug, ncol=6)
```

Visualizing distributions

Histograms and density plots

age	sex	class	survived
0.17	female	3rd	survived
0.33	male	3rd	died
0.80	male	2nd	survived
0.83	male	2nd	survived
0.83	male	3rd	survived
0.92	male	1st	survived
1.00	female	2nd	survived
1.00	female	3rd	survived
1.00	male	2nd	survived
1.00	male	2nd	survived
1.00	male	3rd	survived
1.50	female	3rd	died

age	sex	class	survived
1.5	female	3rd	died
2.0	female	1st	died
2.0	female	2nd	survived
2.0	female	3rd	died
2.0	female	3rd	died
2.0	male	2nd	survived
2.0	male	2nd	survived
2.0	male	2nd	survived
3.0	female	2nd	survived
3.0	female	3rd	survived
3.0	male	2nd	survived
3.0	male	2nd	survived

age	sex	class	survived
3	male	3rd	survived
3	male	3rd	survived
4	female	2nd	survived
4	female	2nd	survived
4	female	3rd	survived
4	female	3rd	survived
4	male	1st	survived
4	male	3rd	died
4	male	3rd	survived
5	female	3rd	survived
5	female	3rd	survived
5	male	3rd	died

Histograms depend on the chosen bin width

Histograms and density plots in ggplot2

Making histograms with ggplot: `geom_histogram()`

```{r}
ggplot(titanic, aes(age)) +
  geom_histogram()
```

Setting the bin width

```{r}
ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5)
```

Do you like where there bins are? What does the first bin say?

Always set the `center` as well, to half the `bin_width`

```{r}
ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center=2.5)
```

Setting center 2.5 makes the bars start 0-5, 5-10, etc. instead of 2.5-7.5, etc. You could instead use the argument boundary=5 to accomplish the same behavior.

Making density plots with ggplot: `geom_density()`

```{r}
ggplot(titanic, aes(age)) +
  geom_density(fill = "skyblue")
```

Making density plots with ggplot: `geom_density()`

```{r}
ggplot(titanic, aes(age)) +
  geom_density()
```

without fill

Boxplots: Showing values along y, conditions along x

A boxplot is a crude way of visualizing a distribution.

How to read a boxplot

If you like density plots, consider violins

A violin plot is a density plot rotated 90 degrees and then mirrored.

How to read a violin plot

Making boxplots, violins, etc. in `ggplot2`

Making boxplots, violins, etc. in ggplot2

Plot type	Geom	Notes
boxplot	`geom_boxplot()`
violin plot	`geom_violin()`
strip chart	`geom_point()`	Jittering requires `position_jitter()`
sina plot	`geom_sina()`	From package ggforce
scatter-density plot	`geom_quasirandom()`	From package ggbeeswarm
ridgeline	`geom_density_ridges()`	From package ggridges

Examples: Boxplot

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  geom_boxplot(fill = "skyblue") 
```

Examples: Violins

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  geom_violin(fill = "skyblue") 
```

Examples: Strip chart (no jitter)

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  geom_point(color = "skyblue") 
```

Examples: ggbeeswarm’s `geom_quasirandom`

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  ggbeeswarm::geom_quasirandom(color = "skyblue") 
```

Group Pen and Paper exercise

10:00

Get with a group of 2-3 people. Go to the activity and pick an option to do together.

Analyze subsets: `group_by()` and `summarize()`

Example application of grouping: Counting

Previously we used count, now we group the data

```{r}
joined_data %>%
  group_by(bug, conc, condition)
```

# A tibble: 384 × 7
# Groups:   bug, conc, condition [91]
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 374 more rows

Example application of grouping: Counting

Previously we used count, now we group the data, and then summarise

```{r}
joined_data %>%
  group_by(bug, conc, condition) %>%
  summarise(
    n = n() # n() returns the number of observations per group
    )
```

# A tibble: 91 × 4
# Groups:   bug, conc [35]
   bug    conc condition     n
   <chr> <dbl> <chr>     <int>
 1 CTRL      0 CTRL         24
 2 CTRL     50 LNA           6
 3 CTRL     50 OA            6
 4 CTRL     50 VCA           6
 5 CTRL    100 LNA           6
 6 CTRL    100 OA            6
 7 CTRL    100 VCA           6
 8 CTRL    200 LNA           6
 9 CTRL    200 OA            6
10 CTRL    200 VCA           6
# ℹ 81 more rows

Example application of grouping: Medians

```{r}
joined_data %>%
  group_by(bug, conc, condition) %>%
  summarise(
    median_od = median(od) 
    )
```

# A tibble: 91 × 4
# Groups:   bug, conc [35]
   bug    conc condition median_od
   <chr> <dbl> <chr>         <dbl>
 1 CTRL      0 CTRL          0.117
 2 CTRL     50 LNA           0.118
 3 CTRL     50 OA            0.118
 4 CTRL     50 VCA           0.119
 5 CTRL    100 LNA           0.120
 6 CTRL    100 OA            0.118
 7 CTRL    100 VCA           0.119
 8 CTRL    200 LNA           0.122
 9 CTRL    200 OA            0.122
10 CTRL    200 VCA           0.122
# ℹ 81 more rows

You can make multiple summaries at once

```{r}
joined_data %>%
  group_by(bug, conc, condition) %>%
  summarise(
    n = n(), 
    median_od = median(od) 
    )
```

# A tibble: 91 × 5
# Groups:   bug, conc [35]
   bug    conc condition     n median_od
   <chr> <dbl> <chr>     <int>     <dbl>
 1 CTRL      0 CTRL         24     0.117
 2 CTRL     50 LNA           6     0.118
 3 CTRL     50 OA            6     0.118
 4 CTRL     50 VCA           6     0.119
 5 CTRL    100 LNA           6     0.120
 6 CTRL    100 OA            6     0.118
 7 CTRL    100 VCA           6     0.119
 8 CTRL    200 LNA           6     0.122
 9 CTRL    200 OA            6     0.122
10 CTRL    200 VCA           6     0.122
# ℹ 81 more rows

Get background OD for each media on each plate

Make a code block and make a variable called media_background_medians that has one row for every combination of plate, lcfa, and conc that gives the median OD measured for those conditions.

Bonus: make a histogram of the media backgrounds (before summarising) for each condition and concentration. Try with a facet by plate and without.

Get no-LCFA OD for each bug on each plate

Now make a variable called bug_no_lcfa_control. Join the media_background_medians to the joined_data, and mutate a column that caluclate the od - media_background. Filter this table so it only has the no LCFA control conditions for each bug on each plate. Then group_by bug and plate_id and get the median of each background subtracted od.

Bonus: make a histogram before summarising of the ods with background subtracted for each bug on each plate.

Challenge: calculate relative growths

Make a table that has the relative growth compared to no LCFA of each bug for each concentration. (Hint: There should be three rows per condition+bug+concentration.) Make a plot showing the relative growths. (e.g. x axis is concentration, y axis relative growth, facet by bug+concentration, and pick a geom to use to show the data.)

Investigate why some relative growths are so high. Think about how you might tweak the analysis to handle that issue.

Reshape: `pivot_wider()` and `pivot_longer()`

Reshaping example: Making a wide summary table

```{r}
joined_data %>%
  count(plate_id, bug, conc, condition) %>%
  pivot_wider(names_from = plate_id, values_from = n)
```

# A tibble: 91 × 7
   bug        conc condition   `1`   `2`   `3`   `4`
   <chr>     <dbl> <chr>     <int> <int> <int> <int>
 1 CTRL          0 CTRL          6     6     6     6
 2 CTRL         50 OA            3    NA     3    NA
 3 CTRL        100 OA            3    NA     3    NA
 4 CTRL        200 OA            3    NA     3    NA
 5 CTRL        200 VCA           3    NA     3    NA
 6 CTRL        400 OA            3    NA     3    NA
 7 CTRL        400 VCA           3    NA     3    NA
 8 crispatus     0 CTRL          6     6    NA    NA
 9 crispatus    50 OA            3    NA    NA    NA
10 crispatus   100 OA            3    NA    NA    NA
# ℹ 81 more rows

Reshaping example: Making a wide summary table

```{r}
joined_data %>%
  count(plate_id, bug, conc, condition) %>%
  pivot_wider(names_from = bug, values_from = n)
```

# A tibble: 28 × 10
   plate_id  conc condition  CTRL crispatus iners jensenii gasseri piotii
      <dbl> <dbl> <chr>     <int>     <int> <int>    <int>   <int>  <int>
 1        1     0 CTRL          6         6     6        6      NA     NA
 2        1    50 OA            3         3     3        3      NA     NA
 3        1   100 OA            3         3     3        3      NA     NA
 4        1   200 OA            3         3     3        3      NA     NA
 5        1   200 VCA           3         3     3        3      NA     NA
 6        1   400 OA            3         3     3        3      NA     NA
 7        1   400 VCA           3         3     3        3      NA     NA
 8        2     0 CTRL          6         6     6        6      NA     NA
 9        2    50 LNA           3         3     3        3      NA     NA
10        2    50 VCA           3         3     3        3      NA     NA
# ℹ 18 more rows
# ℹ 1 more variable: vaginalis <int>

combining datasets: joins

We use joins to add columns from one table into another

There are different types of joins

The differences are all about how to handle when the two tables have different key values

left_join() - the resulting table always has the same key_values as the “left” table

right_join() - the resulting table always has the same key_values as the “right” table

inner_join() - the resulting table always only keeps the key_values that are in both tables

full_join() - the resulting table always has all key_values found in both tables

Left Join

left_join() - the resulting table always has the same key_values as the “left” table

inner_join

inner_join() - the resulting table always only keeps the key_values that are in both tables

table_a %>% inner_join(table_b)

Note, merging tables vertically is bind_rows(), not a join

table_a %>% bind_rows(table_b)

tidyverse for data analysis

Goals for this session

Create an R Project

Create an R Project

Create an R Project

Create an R Project

Create an R Project

Create an R Project

Create an R Project

Create an R Project

RStudio

What are the parts of RStudio?

The text editor

The console

The right panes

Installing and loading packages

Quarto’s Code Chunk

Writing code: assigning variables

Naming variables

Functions

Getting help with functions

Value types in R

Examples of coding

Test data for today

Get the data

Read in data

binding tables together

joining metadata to the data

tibbles (aka data frames)

Elementary data manipulations

The pipe %>% or |> feeds data into functions

The pipe %>% or |> feeds data into functions

The pipe %>% or |> feeds data into functions

Picking rows or columns, and sorting

Pick rows from a table: filter()

Filter only the first row of plates

Filter out samples with OD > 0.5

Pick columns from a table: select()

Pick columns plate_id, and od

Rename columns and subset with select

Sort the rows in a table: arrange()

Sort samples by OD ascending

Sort samples by OD, descending

Counting things

Counting things

Counting things

Use the pipe to build analysis pipelines

Use the pipe to build analysis pipelines

Use the pipe to build analysis pipelines

Adding new columns to a table

Make a new table column: mutate()

Example: concentration

Example: concentration

Example: concentration

Make multiple columns at once

Your turn to do some data wrangling

Break

Aesthetics - the elements of data visualization

Plots map data onto graphical elements.

od mapped to y position

conc mapped to color

Commonly used aesthetics

The same data values can be mapped to different aesthetics

We can use many different aesthetics at once

Creating aesthetic mappings in ggplot

We define the mapping with aes()

The geom determines how the data is shown

The geom determines how the data is shown

The geom determines how the data is shown

Different geoms have parameters for control

Different geoms have parameters for control

Aesthetics can also be used as parameters in geoms

Faceting makes subplots for comparisons.

Faceting makes subplots for comparisons.

Visualizing distributions

Histograms and density plots

Histograms depend on the chosen bin width

Histograms and density plots in ggplot2

Making histograms with ggplot: geom_histogram()

Setting the bin width

`tidyverse` for data analysis

`bind`ing tables together

`join`ing metadata to the data

`tibble`s (aka data frames)

The pipe `%>%` or `|>` feeds data into functions

The pipe `%>%` or `|>` feeds data into functions

The pipe `%>%` or `|>` feeds data into functions

Pick rows from a table: `filter()`

Pick columns from a table: `select()`

Pick columns `plate_id`, and `od`

Sort the rows in a table: `arrange()`

Make a new table column: `mutate()`

Creating aesthetic mappings in `ggplot`

We define the mapping with `aes()`

Making histograms with ggplot: `geom_histogram()`

Always set the `center` as well, to half the `bin_width`

Making density plots with ggplot: `geom_density()`

Making density plots with ggplot: `geom_density()`

Making boxplots, violins, etc. in `ggplot2`

Examples: ggbeeswarm’s `geom_quasirandom`

Analyze subsets: `group_by()` and `summarize()`

Reshape: `pivot_wider()` and `pivot_longer()`