tidyverse for data analysis

remixed from Claus O. Wilke’s SDS375 course and Andrew P. Bray’s quarto workshop




Workshop materials are here:

https://kwondry.github.io/documentation/



Goals for this session

  1. RStudio and the Quarto notebook

  2. Loading and writing tabular data

  3. Data wrangling and make plots with the tidyverse

  4. Tables and statistics

Create an R Project

Create a project for today’s workshop and download the data.

Create an R Project

  1. File -> New Project…

Create an R Project

  1. Click on New Directory

Create an R Project

  1. Navigate to the workshop_2 folder name your directory and click “Create Project”

Create an R Project

  1. You made a project! This creates a file for you with the .qmd extension

Create an R Project

  1. Switch from “visual” to “source” to see the plain-text version of this document.

Create an R Project

  1. Click on “Render” to ask Quarto to turn this plain-text document into an HTML page

Create an R Project

  1. Your default web-browser will open and show you the rendered document!

RStudio

What are the parts of RStudio?

The text editor

The console

The right panes

Installing and loading packages

Packages are a collection of functions and objects that are shared for free to use.

In the console, you can type e.g. install.packages("tidyverse") to install most R packages.

Sometimes R packages need to be installed a different way, and the documentation of the package will tell you how.

Then, to load a package, add library("tidyverse") in a code chunk (usually in the first code cell of your document)

Quarto’s Code Chunk

You can quickly insert chunks like these into your file with

  • the keyboard shortcut Ctrl + Alt + I (OS X: Cmd + Option + I)
  • the Add Chunk command in the editor toolbar
  • or by typing the chunk delimiters
    ```{r}
    ```

Example chunk:

```{r}
10 + 5
```

Try writing a math expression in a code chunk, and hitting the button or Ctrl + Alt + Enter (OS X: Cmd + Option + Enter) to run the chunk.

Writing code: assigning variables

You can use <- or = to assign values to variables

a <- 6
b = 8
c <- 5.44
d = TRUE
e = "hello world" 
e <- 'hello world' # same as double quote

We will use <- for all examples going forward.

Naming variables

A lot of R people use . inside variable names, but in most languages besides R this would be an error. It’s good practice these days to use the _ underscore if you want separation in your variable names.

r.people.sometimes.put.dots <- TRUE
dots.are.confusing <- "maybe"
just_use_underscores <- "please"

Functions

Functions are named bits of code that take parameters as input and return some output

library(tidyverse)
word_1 <- "hello"
word_2 <- "world"
str_c(word_1, word_2, sep = " ")
[1] "hello world"

str_c is a function that puts concatenates strings.

functions can have named parameters as well as positional parameters.
named parameters always take an = sign for assignment.

Getting help with functions

  • Type ?str_c in the console to get a help page. check out this guide on how to read the R help pages.

  • Google! Add “tidyverse” to search queries to get more relevant results.

  • phind.com and chat.deepseek.com are good free AI services for getting help with code.

Value types in R

The type of the value can be

# numeric
c(1,2,3,4) 

# character
c("a","b","c","d")

# boolean
c(TRUE, FALSE)

# factor
c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") %>% as_factor()

Examples of coding

Quick live demo of doing some work in R

  • Assigning variables
  • Vectors are magic
  • Functions and getting help
05:00

Test data for today

  • Bacterial growth measurements with different species

  • Measured as an optical density (OD) of the culture at the end of the experiment

  • Growth measured with different concentrations of different long-chain fatty acids added to the media

Get the data

Download the data from kwondry.github.io/documentation/r-tutorial. Put it in a folder called data inside your R project folder.

codebook:

plate_id – an identifier of which plate was measured. row – row of the plate1
column – column of the plate
bug – the isolate/species that was tested in this well
condition - which long chain fatty acid (LCFA) was added
conc - what was the concentration of the LCFA in this well?
row – row of the plate
column – column of the plate
od - the optical density (OD600) that was measured in this well. This is a measure of bacterial growth.

Read in data

Data is often in tables, and the easiest way to store tabular data is in csv or tsv format.

csv - comma separated values
tsv - tab separated values

to read in data stored this way use read_csv(filename) or read_tsv(filename)

metadata <- read_csv("data/metadata.csv")

binding tables together

We have data from 4 different plates in separate csv files. Use bind_rows to make a single table with all the data.

plate_1 <- read_csv("data/plate_1_data.csv")
plate_2 <- read_csv("data/plate_2_data.csv")
plate_3 <- read_csv("data/plate_3_data.csv")
plate_4 <- read_csv("data/plate_4_data.csv")

all_plate_data <- bind_rows(plate_1, plate_2, plate_3, plate_4)

joining metadata to the data

Connect the metadata to the plate reader data using left_join

joined_data <- all_plate_data %>%
  left_join(metadata)

tibbles (aka data frames)

tibbles are the big reason R is great for working with tabular data.

A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).

joined_data
# A tibble: 384 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 374 more rows

Elementary data manipulations

  • Pick rows: filter()

  • Pick columns: select()

  • Sort rows: arrange()

  • Count things: count()

  • Make new columns: mutate()

The pipe %>% or |> feeds data into functions

```{r}
head(joined_data) # show the first few rows
```
# A tibble: 6 × 7
     od plate_id row   column bug       condition  conc
  <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
1 0.206        1 a          1 CTRL      CTRL          0
2 0.171        1 a          2 crispatus CTRL          0
3 0.136        1 a          3 crispatus CTRL          0
4 0.131        1 a          4 crispatus CTRL          0
5 0.137        1 a          5 jensenii  CTRL          0
6 0.14         1 a          6 jensenii  CTRL          0

The pipe %>% or |> feeds data into functions

```{r}
joined_data %>%
  head()
```
# A tibble: 6 × 7
     od plate_id row   column bug       condition  conc
  <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
1 0.206        1 a          1 CTRL      CTRL          0
2 0.171        1 a          2 crispatus CTRL          0
3 0.136        1 a          3 crispatus CTRL          0
4 0.131        1 a          4 crispatus CTRL          0
5 0.137        1 a          5 jensenii  CTRL          0
6 0.14         1 a          6 jensenii  CTRL          0

The pipe %>% or |> feeds data into functions

```{r}
joined_data %>%
  head() %>%
  tail(1)
```
# A tibble: 1 × 7
     od plate_id row   column bug      condition  conc
  <dbl>    <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>
1  0.14        1 a          6 jensenii CTRL          0

Picking rows or columns, and sorting

Pick rows from a table: filter()

Filter only the first row of plates

```{r}
joined_data %>%
  filter(row == "a")
```
# A tibble: 48 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 38 more rows

Filter out samples with OD > 0.5

```{r}
joined_data %>%
  filter(od > 0.5)
```
# A tibble: 20 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.534        1 b          7 jensenii  OA          400
 2 0.61         1 f          5 jensenii  VCA         400
 3 0.626        1 f          6 jensenii  VCA         400
 4 0.627        1 f          7 jensenii  VCA         400
 5 0.747        3 b          2 gasseri   OA          400
 6 0.769        3 b          3 gasseri   OA          400
 7 0.75         3 b          4 gasseri   OA          400
 8 0.652        3 b          7 vaginalis OA          400
 9 0.631        3 c          2 gasseri   OA          200
10 0.607        3 c          3 gasseri   OA          200
11 0.624        3 c          4 gasseri   OA          200
12 0.908        3 f          2 gasseri   VCA         400
13 0.924        3 f          3 gasseri   VCA         400
14 0.867        3 f          4 gasseri   VCA         400
15 0.73         3 g          2 gasseri   VCA         200
16 0.764        3 g          3 gasseri   VCA         200
17 0.725        3 g          4 gasseri   VCA         200
18 0.688        4 b          2 gasseri   VCA         100
19 0.636        4 b          3 gasseri   VCA         100
20 0.591        4 b          4 gasseri   VCA         100

Pick columns from a table: select()

Pick columns plate_id, and od

```{r}
joined_data %>%
  select(plate_id, od)
```
# A tibble: 384 × 2
   plate_id    od
      <dbl> <dbl>
 1        1 0.206
 2        1 0.171
 3        1 0.136
 4        1 0.131
 5        1 0.137
 6        1 0.14 
 7        1 0.144
 8        1 0.126
 9        1 0.13 
10        1 0.127
# ℹ 374 more rows

Rename columns and subset with select

```{r}
joined_data %>%
  select(plate_id, optical_density=od)
```
# A tibble: 384 × 2
   plate_id optical_density
      <dbl>           <dbl>
 1        1           0.206
 2        1           0.171
 3        1           0.136
 4        1           0.131
 5        1           0.137
 6        1           0.14 
 7        1           0.144
 8        1           0.126
 9        1           0.13 
10        1           0.127
# ℹ 374 more rows

Sort the rows in a table: arrange()

Sort samples by OD ascending

```{r}
joined_data %>%
  arrange(od)
```
# A tibble: 384 × 7
      od plate_id row   column bug      condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>
 1 0.113        1 h         11 CTRL     CTRL          0
 2 0.113        1 h         12 CTRL     CTRL          0
 3 0.113        2 h         11 CTRL     CTRL          0
 4 0.114        1 e         12 CTRL     OA           50
 5 0.114        1 h          1 CTRL     CTRL          0
 6 0.114        2 h          1 CTRL     CTRL          0
 7 0.114        2 h         12 CTRL     CTRL          0
 8 0.115        1 d         12 CTRL     OA          100
 9 0.115        2 a         11 CTRL     CTRL          0
10 0.115        2 h          5 jensenii CTRL          0
# ℹ 374 more rows

Sort samples by OD, descending

```{r}
joined_data %>%
  arrange(desc(od))
```
# A tibble: 384 × 7
      od plate_id row   column bug     condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>   <chr>     <dbl>
 1 0.924        3 f          3 gasseri VCA         400
 2 0.908        3 f          2 gasseri VCA         400
 3 0.867        3 f          4 gasseri VCA         400
 4 0.769        3 b          3 gasseri OA          400
 5 0.764        3 g          3 gasseri VCA         200
 6 0.75         3 b          4 gasseri OA          400
 7 0.747        3 b          2 gasseri OA          400
 8 0.73         3 g          2 gasseri VCA         200
 9 0.725        3 g          4 gasseri VCA         200
10 0.688        4 b          2 gasseri VCA         100
# ℹ 374 more rows

Counting things

To demonstrate counting, let’s switch to metadata

```{r}
joined_data
```
# A tibble: 384 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 374 more rows

Counting things

```{r}
joined_data %>%
  count(bug)
```
# A tibble: 7 × 2
  bug           n
  <chr>     <int>
1 CTRL         96
2 crispatus    48
3 gasseri      48
4 iners        48
5 jensenii     48
6 piotii       48
7 vaginalis    48

Counting things

```{r}
joined_data %>%
  count(condition, conc, bug)
```
# A tibble: 91 × 4
   condition  conc bug           n
   <chr>     <dbl> <chr>     <int>
 1 CTRL          0 CTRL         24
 2 CTRL          0 crispatus    12
 3 CTRL          0 gasseri      12
 4 CTRL          0 iners        12
 5 CTRL          0 jensenii     12
 6 CTRL          0 piotii       12
 7 CTRL          0 vaginalis    12
 8 LNA          50 CTRL          6
 9 LNA          50 crispatus     3
10 LNA          50 gasseri       3
# ℹ 81 more rows

Use the pipe to build analysis pipelines

```{r}
joined_data %>%
  filter(bug == "crispatus")
```
# A tibble: 48 × 7
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.171        1 a          2 crispatus CTRL          0
 2 0.136        1 a          3 crispatus CTRL          0
 3 0.131        1 a          4 crispatus CTRL          0
 4 0.453        1 b          2 crispatus OA          400
 5 0.478        1 b          3 crispatus OA          400
 6 0.416        1 b          4 crispatus OA          400
 7 0.383        1 c          2 crispatus OA          200
 8 0.401        1 c          3 crispatus OA          200
 9 0.37         1 c          4 crispatus OA          200
10 0.303        1 d          2 crispatus OA          100
# ℹ 38 more rows

Use the pipe to build analysis pipelines

```{r}
joined_data %>%
  filter(bug == "crispatus") %>%
  filter(conc > 50) %>%
  select(plate_id, bug, condition)
```
# A tibble: 27 × 3
   plate_id bug       condition
      <dbl> <chr>     <chr>    
 1        1 crispatus OA       
 2        1 crispatus OA       
 3        1 crispatus OA       
 4        1 crispatus OA       
 5        1 crispatus OA       
 6        1 crispatus OA       
 7        1 crispatus OA       
 8        1 crispatus OA       
 9        1 crispatus OA       
10        1 crispatus VCA      
# ℹ 17 more rows

Use the pipe to build analysis pipelines

```{r}
joined_data %>%
  filter(bug == "crispatus") %>%
  filter(conc > 50) %>%
  select(plate_id, bug, condition) %>%
  count(condition)
```
# A tibble: 3 × 2
  condition     n
  <chr>     <int>
1 LNA           9
2 OA            9
3 VCA           9

Adding new columns to a table

Make a new table column: mutate()

Example: concentration

The conc column is in units of uM. What if you needed it in mM? What’s the calculation?

```{r}
sample_n(metadata,5)
```
# A tibble: 5 × 6
  plate_id row   column bug    condition  conc
     <dbl> <chr>  <dbl> <chr>  <chr>     <dbl>
1        4 c         11 CTRL   VCA          50
2        4 b          1 CTRL   VCA         100
3        3 g          9 piotii VCA         200
4        3 e         11 CTRL   OA           50
5        3 b          1 CTRL   OA          400

Example: concentration

The conc column is in units of uM. What if you needed it in mM? What’s the calculation?

To get mM you would divide by 1000.

```{r}
sample_n(metadata,5)
```
# A tibble: 5 × 6
  plate_id row   column bug       condition  conc
     <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
1        1 h          9 iners     CTRL          0
2        4 d         12 CTRL      LNA         400
3        1 d          2 crispatus OA          100
4        2 e          5 jensenii  LNA         200
5        2 c          5 jensenii  VCA          50

Example: concentration

The conc column is in units of uM. What if you needed it in mM? What’s the calculation?

To get mM you would divide by 1000.

```{r}
sample_n(metadata,5) %>%
  mutate(conc_mM = conc/1000)
```
# A tibble: 5 × 7
  plate_id row   column bug      condition  conc conc_mM
     <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>   <dbl>
1        2 c          6 jensenii VCA          50    0.05
2        1 g          8 iners    VCA         200    0.2 
3        1 b          9 iners    OA          400    0.4 
4        4 f          3 gasseri  LNA         100    0.1 
5        4 h          3 gasseri  CTRL          0    0   

Make multiple columns at once

```{r}
sample_n(metadata,5) %>%
  mutate(conc_mM = conc/1000, conc_nM = conc*1000)
```
# A tibble: 5 × 8
  plate_id row   column bug       condition  conc conc_mM conc_nM
     <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>   <dbl>   <dbl>
1        2 d         10 iners     LNA         400     0.4  400000
2        4 b          8 piotii    VCA         100     0.1  100000
3        2 a         12 CTRL      CTRL          0     0         0
4        3 h          6 vaginalis CTRL          0     0         0
5        2 f          4 crispatus LNA         100     0.1  100000

Your turn to do some data wrangling

15:00

Write code to answer the following questions?

  1. How many different concentrations of LCFA are tested?

  2. How many different LCFAs are tested on each plate?

  3. What bug has the highest OD seen in all the plates?

  4. What bug has the highest OD when no LCFA is added?

  5. What control well with no bug and no LCFA has the highest OD?

Break

15:00

Aesthetics - the elements of data visualization

Plots map data onto graphical elements.

joined_data`
od plate_id row column bug condition conc
0.206 1 a 1 CTRL CTRL 0
0.171 1 a 2 crispatus CTRL 0
0.136 1 a 3 crispatus CTRL 0
0.131 1 a 4 crispatus CTRL 0
0.137 1 a 5 jensenii CTRL 0
0.140 1 a 6 jensenii CTRL 0

od mapped to y position

conc mapped to color

Commonly used aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

The same data values can be mapped to different aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

We can use many different aesthetics at once

Creating aesthetic mappings in ggplot

We define the mapping with aes()

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter()
```

The geom determines how the data is shown

```{r}
ggplot(joined_data, mapping = aes(x = conc, y = od, color = bug)) +
  geom_point()
```

The geom determines how the data is shown

```{r}
ggplot(joined_data, mapping = aes(x = as.factor(conc), y = od, color = bug)) +
  geom_boxplot()
```

The geom determines how the data is shown

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter()
```

Different geoms have parameters for control

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3)
```

Different geoms have parameters for control

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3, width=10)
```

Aesthetics can also be used as parameters in geoms

```{r}
#| output-location: column
ggplot(joined_data, mapping = aes(x = as.factor(conc), y = od, color = bug)) +
  geom_boxplot(fill="orange")
```

Faceting makes subplots for comparisons.

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3, width=10) +
  facet_wrap(~condition)
```

Faceting makes subplots for comparisons.

```{r}
joined_data %>%
  ggplot(mapping = aes(x = conc, y = od, color = bug)) +
  geom_jitter(size=3, width=10) +
  facet_wrap(~condition+bug, ncol=6)
```

Visualizing distributions

Histograms and density plots

age sex class survived
0.17 female 3rd survived
0.33 male 3rd died
0.80 male 2nd survived
0.83 male 2nd survived
0.83 male 3rd survived
0.92 male 1st survived
1.00 female 2nd survived
1.00 female 3rd survived
1.00 male 2nd survived
1.00 male 2nd survived
1.00 male 3rd survived
1.50 female 3rd died
age sex class survived
1.5 female 3rd died
2.0 female 1st died
2.0 female 2nd survived
2.0 female 3rd died
2.0 female 3rd died
2.0 male 2nd survived
2.0 male 2nd survived
2.0 male 2nd survived
3.0 female 2nd survived
3.0 female 3rd survived
3.0 male 2nd survived
3.0 male 2nd survived
age sex class survived
3 male 3rd survived
3 male 3rd survived
4 female 2nd survived
4 female 2nd survived
4 female 3rd survived
4 female 3rd survived
4 male 1st survived
4 male 3rd died
4 male 3rd survived
5 female 3rd survived
5 female 3rd survived
5 male 3rd died

Histograms depend on the chosen bin width

Histograms and density plots in ggplot2

Making histograms with ggplot: geom_histogram()

```{r}
ggplot(titanic, aes(age)) +
  geom_histogram()
```

Setting the bin width

```{r}
ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5)
```

Do you like where there bins are? What does the first bin say?

Always set the center as well, to half the bin_width

```{r}
ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center=2.5)
```

Setting center 2.5 makes the bars start 0-5, 5-10, etc. instead of 2.5-7.5, etc. You could instead use the argument boundary=5 to accomplish the same behavior.

Making density plots with ggplot: geom_density()

```{r}
ggplot(titanic, aes(age)) +
  geom_density(fill = "skyblue")
```

Making density plots with ggplot: geom_density()

```{r}
ggplot(titanic, aes(age)) +
  geom_density()
```

without fill

Boxplots: Showing values along y, conditions along x

A boxplot is a crude way of visualizing a distribution.

How to read a boxplot

If you like density plots, consider violins

A violin plot is a density plot rotated 90 degrees and then mirrored.

How to read a violin plot

Making boxplots, violins, etc. in ggplot2

Making boxplots, violins, etc. in ggplot2

Plot type Geom Notes
boxplot geom_boxplot()
violin plot geom_violin()
strip chart geom_point() Jittering requires position_jitter()
sina plot geom_sina() From package ggforce
scatter-density plot geom_quasirandom() From package ggbeeswarm
ridgeline geom_density_ridges() From package ggridges

Examples: Boxplot

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  geom_boxplot(fill = "skyblue") 
```

Examples: Violins

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  geom_violin(fill = "skyblue") 
```

Examples: Strip chart (no jitter)

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  geom_point(color = "skyblue") 
```

Examples: ggbeeswarm’s geom_quasirandom

```{r}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  ggbeeswarm::geom_quasirandom(color = "skyblue") 
```

Group Pen and Paper exercise

10:00

Get with a group of 2-3 people. Go to the activity and pick an option to do together.

Analyze subsets: group_by() and summarize()

Example application of grouping: Counting

Previously we used count, now we group the data

```{r}
joined_data %>%
  group_by(bug, conc, condition)
```
# A tibble: 384 × 7
# Groups:   bug, conc, condition [91]
      od plate_id row   column bug       condition  conc
   <dbl>    <dbl> <chr>  <dbl> <chr>     <chr>     <dbl>
 1 0.206        1 a          1 CTRL      CTRL          0
 2 0.171        1 a          2 crispatus CTRL          0
 3 0.136        1 a          3 crispatus CTRL          0
 4 0.131        1 a          4 crispatus CTRL          0
 5 0.137        1 a          5 jensenii  CTRL          0
 6 0.14         1 a          6 jensenii  CTRL          0
 7 0.144        1 a          7 jensenii  CTRL          0
 8 0.126        1 a          8 iners     CTRL          0
 9 0.13         1 a          9 iners     CTRL          0
10 0.127        1 a         10 iners     CTRL          0
# ℹ 374 more rows

Example application of grouping: Counting

Previously we used count, now we group the data, and then summarise

```{r}
joined_data %>%
  group_by(bug, conc, condition) %>%
  summarise(
    n = n() # n() returns the number of observations per group
    )
```
# A tibble: 91 × 4
# Groups:   bug, conc [35]
   bug    conc condition     n
   <chr> <dbl> <chr>     <int>
 1 CTRL      0 CTRL         24
 2 CTRL     50 LNA           6
 3 CTRL     50 OA            6
 4 CTRL     50 VCA           6
 5 CTRL    100 LNA           6
 6 CTRL    100 OA            6
 7 CTRL    100 VCA           6
 8 CTRL    200 LNA           6
 9 CTRL    200 OA            6
10 CTRL    200 VCA           6
# ℹ 81 more rows

Example application of grouping: Medians

```{r}
joined_data %>%
  group_by(bug, conc, condition) %>%
  summarise(
    median_od = median(od) 
    )
```
# A tibble: 91 × 4
# Groups:   bug, conc [35]
   bug    conc condition median_od
   <chr> <dbl> <chr>         <dbl>
 1 CTRL      0 CTRL          0.117
 2 CTRL     50 LNA           0.118
 3 CTRL     50 OA            0.118
 4 CTRL     50 VCA           0.119
 5 CTRL    100 LNA           0.120
 6 CTRL    100 OA            0.118
 7 CTRL    100 VCA           0.119
 8 CTRL    200 LNA           0.122
 9 CTRL    200 OA            0.122
10 CTRL    200 VCA           0.122
# ℹ 81 more rows

You can make multiple summaries at once

```{r}
joined_data %>%
  group_by(bug, conc, condition) %>%
  summarise(
    n = n(), 
    median_od = median(od) 
    )
```
# A tibble: 91 × 5
# Groups:   bug, conc [35]
   bug    conc condition     n median_od
   <chr> <dbl> <chr>     <int>     <dbl>
 1 CTRL      0 CTRL         24     0.117
 2 CTRL     50 LNA           6     0.118
 3 CTRL     50 OA            6     0.118
 4 CTRL     50 VCA           6     0.119
 5 CTRL    100 LNA           6     0.120
 6 CTRL    100 OA            6     0.118
 7 CTRL    100 VCA           6     0.119
 8 CTRL    200 LNA           6     0.122
 9 CTRL    200 OA            6     0.122
10 CTRL    200 VCA           6     0.122
# ℹ 81 more rows

Get background OD for each media on each plate

Make a code block and make a variable called media_background_medians that has one row for every combination of plate, lcfa, and conc that gives the median OD measured for those conditions.

Bonus: make a histogram of the media backgrounds (before summarising) for each condition and concentration. Try with a facet by plate and without.

Get no-LCFA OD for each bug on each plate

Now make a variable called bug_no_lcfa_control. Join the media_background_medians to the joined_data, and mutate a column that caluclate the od - media_background. Filter this table so it only has the no LCFA control conditions for each bug on each plate. Then group_by bug and plate_id and get the median of each background subtracted od.

Bonus: make a histogram before summarising of the ods with background subtracted for each bug on each plate.

Challenge: calculate relative growths

Make a table that has the relative growth compared to no LCFA of each bug for each concentration. (Hint: There should be three rows per condition+bug+concentration.) Make a plot showing the relative growths. (e.g. x axis is concentration, y axis relative growth, facet by bug+concentration, and pick a geom to use to show the data.)

Investigate why some relative growths are so high. Think about how you might tweak the analysis to handle that issue.

Reshape: pivot_wider() and pivot_longer()

Reshaping example: Making a wide summary table

```{r}
joined_data %>%
  count(plate_id, bug, conc, condition) %>%
  pivot_wider(names_from = plate_id, values_from = n)
```
# A tibble: 91 × 7
   bug        conc condition   `1`   `2`   `3`   `4`
   <chr>     <dbl> <chr>     <int> <int> <int> <int>
 1 CTRL          0 CTRL          6     6     6     6
 2 CTRL         50 OA            3    NA     3    NA
 3 CTRL        100 OA            3    NA     3    NA
 4 CTRL        200 OA            3    NA     3    NA
 5 CTRL        200 VCA           3    NA     3    NA
 6 CTRL        400 OA            3    NA     3    NA
 7 CTRL        400 VCA           3    NA     3    NA
 8 crispatus     0 CTRL          6     6    NA    NA
 9 crispatus    50 OA            3    NA    NA    NA
10 crispatus   100 OA            3    NA    NA    NA
# ℹ 81 more rows

Reshaping example: Making a wide summary table

```{r}
joined_data %>%
  count(plate_id, bug, conc, condition) %>%
  pivot_wider(names_from = bug, values_from = n)
```
# A tibble: 28 × 10
   plate_id  conc condition  CTRL crispatus iners jensenii gasseri piotii
      <dbl> <dbl> <chr>     <int>     <int> <int>    <int>   <int>  <int>
 1        1     0 CTRL          6         6     6        6      NA     NA
 2        1    50 OA            3         3     3        3      NA     NA
 3        1   100 OA            3         3     3        3      NA     NA
 4        1   200 OA            3         3     3        3      NA     NA
 5        1   200 VCA           3         3     3        3      NA     NA
 6        1   400 OA            3         3     3        3      NA     NA
 7        1   400 VCA           3         3     3        3      NA     NA
 8        2     0 CTRL          6         6     6        6      NA     NA
 9        2    50 LNA           3         3     3        3      NA     NA
10        2    50 VCA           3         3     3        3      NA     NA
# ℹ 18 more rows
# ℹ 1 more variable: vaginalis <int>

combining datasets: joins

We use joins to add columns from one table into another

There are different types of joins

The differences are all about how to handle when the two tables have different key values

left_join() - the resulting table always has the same key_values as the “left” table

right_join() - the resulting table always has the same key_values as the “right” table

inner_join() - the resulting table always only keeps the key_values that are in both tables

full_join() - the resulting table always has all key_values found in both tables

Left Join

left_join() - the resulting table always has the same key_values as the “left” table

inner_join

inner_join() - the resulting table always only keeps the key_values that are in both tables

table_a %>% inner_join(table_b)

Note, merging tables vertically is bind_rows(), not a join

table_a %>% bind_rows(table_b)