Design and Analysis of Experiments Using R

class: center, middle, inverse, title-slide

.title[
# Design and Analysis of Experiments Using R
]
.subtitle[
## Completely Randomised Design (Continued)
]
.author[
### Olga Lyashevska
]
.date[
### 2022-10-06
]

---

# This week

<img src=figs/roadmap-crd.png style="width: 90%" />
---

# CRD

.font150[
- one treatment factor;
- `$n$` experimental units;
- `$g$` levels of treatment factor; 
- each group is subject to one of the unique levels of the treatment factor;
- the experimental units are homogeneous.
]

---

# Linear model

.font140[

Models for experimental data have two basic parts:

- *structure for the means* - the average or expected values for the data;

- *structure for the errors* - how the data vary around the treatment means;

Model is linear because the response variable is as a linear function of
the parameters. 
]

---
# Models for the mean response

.font140[
Most of our inference is about treatment means (first part):

- Any evidence that means are not all the same?
- Which ones differ?
- Any pattern in differences?
- How can differences be described?
- Estimates/confidence intervals of means and differences.
- Sometimes variances are more interesting, or quantiles such as the median, or other
things like the number of modes.
- Sometimes it is extremes that are of interest
]

---
# Model specification: separate means

.font140[
A simple statistical model can be written as:

`$$\large y_{ij}=\mu_{i}+\epsilon_{ij}$$`

Where `$y_{ij}$` is the response for the `$j$`th experimental unit to the `$i$`th level of the treatment factor;

`$\mu_i$` is a different mean for each level of the treatment factor `$i$`;

`$\epsilon_{ij}$` is an error.

It is a full model.
]

---
# Example

.font140[

```r
dat<-read.csv("outputs/CholesterolExperimentFilled.csv")
summary(dat)
```

```
#     subject          dose          chol      
#  Min.   : 1.0   Min.   :  0   Min.   :180.0  
#  1st Qu.: 4.5   1st Qu.:  0   1st Qu.:195.0  
#  Median : 8.0   Median : 50   Median :200.0  
#  Mean   : 8.0   Mean   : 50   Mean   :211.3  
#  3rd Qu.:11.5   3rd Qu.:100   3rd Qu.:230.0  
#  Max.   :15.0   Max.   :100   Max.   :250.0
```

```r
tapply(X=dat$chol,INDEX=dat$dose,FUN=mean)
```

```
#   0  50 100 
# 210 216 208
```
]
---
# Model specification: single mean

.font140[
Alternatives:
`$$y_{ij}=\beta_{0} + \beta_{1}x_{i}+\epsilon_{ij}$$`
`$$y_{ij}=\beta_{0} + \beta_{1}x_{i}+\beta_{2}x_{i}^2+\epsilon_{ij}$$`

Single mean model is a special case of the separate means model, a reduced or restricted version.

]

---
# Example

.font140[

```r
mod <- lm(chol~as.factor(dose), data=dat)
summary(mod)
```

```
# 
# Call:
# lm(formula = chol ~ as.factor(dose), data = dat)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#    -36    -14     -8     20     34 
# 
# Coefficients:
#                    Estimate Std. Error t value Pr(>|t|)    
# (Intercept)          210.00      11.55  18.187 4.22e-10 ***
# as.factor(dose)50      6.00      16.33   0.367    0.720    
# as.factor(dose)100    -2.00      16.33  -0.122    0.905    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 25.82 on 12 degrees of freedom
# Multiple R-squared:  0.02121,	Adjusted R-squared:  -0.1419 
# F-statistic:  0.13 on 2 and 12 DF,  p-value: 0.8793
```
]
---
# Assumptions

.font140[
The full model assumes that every treatment factor has its own mean response `$\mu_i$`.

Combined with the error structure, the separate means model implies that all the data are independent and normally distributed with constant variance:

`$$y_{ij} ∼ N(\mu_{i}, \sigma^2 )$$`

`$∼N(\mu_{i},\sigma^2)$` denotes "has a normal distribution with mean 0 and variance `$\sigma^2$`".

Here: Each treatment group may have its own mean.
]

---
# Assumptions

.font150[
We assume that errors `$\epsilon^2$` (deviations from the treatment means) are:

- independent for different data values;
- have mean zero;
- have the same variance;
- follow a Normal distribution.

`$$\epsilon_{ij}\stackrel{iid}{\sim}N(0, \sigma^{2})$$`
(iid independent and identically distributed)

Normal distribution of errors needed for inference.
]

---
# Groups mean and Overall mean

.font130[
The group means `$\mu_{i}$` can be expressed as a sum of overall mean  `$\mu^*$`  and treatment effect `$\alpha_{i}$`, hence

`$$\mu_{i}=\mu^*+\alpha_{i}$$`

Therefore, `$\mu^*$` can be defined as the mean of the `$g$` treatments:

`$$\mu^*=\sum_{i=1}^{g}\mu_{i}/g$$`
The sum of treatment effects is zero:
`$$\sum_{i=1}^{g}\alpha_{i}=0$$`
Will be useful later for factorial design.
]

---
# Weighted mean

.font130[
Alternative we can assume that `$\mu^*$` is the weighted average of the treatment means, with the sample sizes `$n_{i}$` used as weights:

`$$\mu^*=\sum_{i=1}^{g}n_{i}\mu_{i}/N$$`

The weighted sum of the treatment effects is zero:

`$$\sum_{i=1}^{g}n_{i}\alpha_{i}=0$$`
When the sample sizes are equal, these two choices coincide. 
Restriction that the treatment effects `$α_i$` add to zero (either weighted
or not) implies that the treatment effects are not completely free to vary.
]

---
# Estimation of Parameters

.font140[
Total Sum of Squares

We usually choose model parameters so as to minimize the sum of squared residuals.
The total sum of squares in the data `$SS_T$` is the sum of the model or explained sum of
squares `$SS_M$` plus the error or residual sum of squares `$SS_E$` . 
`$$SS_T = SS_M + SS_E$$`
For a fixed set of data, if you change the model making one `$SS$` bigger, then the other must get smaller.

`$SS_M$` is variability explained by allowing means to change linearly with dose;

`$SS_E$` is the variability which is left over or not explained.
]

---
# Total Sum of Squares

.font130[

The total sum of squares:

`$$SS_{T} = \sum_{i=1}^{g}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y_{..}})^2$$`

where, `$\bar{y_{..}}$` the grand mean of all observations is

`$$\bar{y_{..}} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_{i}}y_{ij}$$`

It is a mean of all observations which represents total variability in responses about an overall common mean 
`$\bar{y}$`

]

---
# Model and Error Sum of Squares

.font140[
For `$\large y_{ij}=\mu_{i}+\epsilon_{ij}$` the Model Sum of Squares is variability explained by allowing means to change within group:

`$$SS_{M} = \sum_{i=1}^{g}\sum_{j=1}^{n_{i}}(\bar{y}_{i.}-\bar{y_{..}})^2$$`

And Error Sum of Squares is the slop that is left over, i.e. the variability within groups.
`$$SS_{E} = \sum_{i=1}^{g}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y_{i.}})^2$$`

]

---
# Which model to use?

.font140[ 
- Test whether the means are all the same or if some of them differ. 
- Check whether the data can be adequately described by the model of a single mean or the model of separate treatment group means is needed. 
- Recall that the single mean model is a special case of the group means model.

Analysis of Variance (ANOVA), is a method for comparing the fit
of two models, one a reduced version of the other.
]

---
# Hypothesis test of no treatment effect 
.font140[

`$H_{0}: \mu_{1}=\mu_{2}=\mu_{i}$`

The alternative `$H_{0}$` at least one `$\mu_{i}$` is different

If `$H_{0}$` is true, the model `$\large y_{ij}=\mu_{i}+\epsilon_{ij}$` simplifies to `$$\large y_{ij}=\mu+\epsilon_{ij}$$` which can be represented as a single normal distribution with
`$$y_{ij} ∼ N(\mu, \sigma^2 )$$`
]

---
# Analysis of Variance table

.font140[
Under the `$H_{0}: \mu_{1}=\mu_{2}=\mu_{i}$` both the `$SS_{T}$` and `$SS_{E}$` follow `$\chi^{2}$` [distribution](https://en.wikipedia.org/wiki/Chi-square_distribution).
These Sums of Squares and their corresponding mean squares which are formed by dividing each Sum of Squares by its degrees of freedom are usually presented in ANOVA tables.

<img src="figs/anova-Sum-Squares.png" style="width: 90%" />
Under the null hypothesis, the F-ratio follows the [F-distribution](https://en.wikipedia.org/wiki/F-distribution).
with `$t−1$` and `$n−t$` [degrees of freedom](https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-degrees-of-freedom-in-statistics).
 
]
---
# Example

.font130[

```r
mod<-aov(chol~as.factor(dose), data=dat)
summary(mod)
```

```
#                 Df Sum Sq Mean Sq F value Pr(>F)
# as.factor(dose)  2    173    86.7    0.13  0.879
# Residuals       12   8000   666.7
```

`$SS_{M}$` is on the line labeled as `as.factor(dose)` and `$SS_{E}$` is on the line labeled as `Residuals`. We can compute `$SS_{T}$` by adding them. 
]

.font130[
Confirm that the degrees of freedom and F-value are correct.

The last column labeled `Pr(>F)` is the probability of exceeding the calculated F-value if the null hypothesis is true. This is called the p-value.

]

---
# Accept or reject?

.font140[
- Choose the significance level `$\alpha$` for the hypothesis test 
- Reject the hypothesis `$H_{0}$` if the `Pr(>F)` value on the `aov` output is less than the chosen value of `$\alpha$`. 
]

.font140[
For the cholesterol experiment there are no significant differences among the
mean cholesterol level for each dose level groupp at the significance level `$\alpha = 0.05$`
since 0.879 > 0.05. We accept `$H_{0}$` and conclude that `$\mu_{1}=\mu_{2}=\mu_{i}$` are equal. 
]

---
# Occam’s razor

.font140[

We seek the simplest model consistent with the data.

*"All treatments have the same mean"* is simpler than

*"Each treatment has its own mean."*

If we cannot say that the complicated model is needed, we take the simple model.
]

---
# Model is an approximation

.font140[

Model is a gross approximation to reality.

*All models are wrong; some models are useful.* — George Box

We do not believe any model is really true, but if the data are consistent with it, we use it.

That's why checking assumptions is essentials.
]
---
# Lab

.font130[
Read built-in dataset

```r
# list built-in data
data()
# get help on a specific dataset
?PlantGrowth
# Load in the workspace
data(PlantGrowth)
# list objects in the environment
ls()
# remove 
# rm(list=ls())
# see structure
str(PlantGrowth)
# see dimension
dim(PlantGrowth)
# see summary statistics
summary(PlantGrowth)
```
]