class: center, middle, inverse, title-slide .title[ # Design and Analysis of Experiments Using R ] .subtitle[ ## Completely Randomised Design (Continued) ] .author[ ### Olga Lyashevska ] .date[ ### 2022-10-06 ] --- # This week <img src=figs/roadmap-crd.png style="width: 90%" /> --- # CRD .font150[ - one treatment factor; - `\(n\)` experimental units; - `\(g\)` levels of treatment factor; - each group is subject to one of the unique levels of the treatment factor; - the experimental units are homogeneous. ] --- # Linear model .font140[ Models for experimental data have two basic parts: - *structure for the means* - the average or expected values for the data; - *structure for the errors* - how the data vary around the treatment means; Model is linear because the response variable is as a linear function of the parameters. ] --- # Models for the mean response .font140[ Most of our inference is about treatment means (first part): - Any evidence that means are not all the same? - Which ones differ? - Any pattern in differences? - How can differences be described? - Estimates/confidence intervals of means and differences. - Sometimes variances are more interesting, or quantiles such as the median, or other things like the number of modes. - Sometimes it is extremes that are of interest ] --- # Model specification: separate means .font140[ A simple statistical model can be written as: `$$\large y_{ij}=\mu_{i}+\epsilon_{ij}$$` Where `\(y_{ij}\)` is the response for the `\(j\)`th experimental unit to the `\(i\)`th level of the treatment factor; `\(\mu_i\)` is a different mean for each level of the treatment factor `\(i\)`; `\(\epsilon_{ij}\)` is an error. It is a full model. ] --- # Example .font140[ ```r dat<-read.csv("outputs/CholesterolExperimentFilled.csv") summary(dat) ``` ``` # subject dose chol # Min. : 1.0 Min. : 0 Min. :180.0 # 1st Qu.: 4.5 1st Qu.: 0 1st Qu.:195.0 # Median : 8.0 Median : 50 Median :200.0 # Mean : 8.0 Mean : 50 Mean :211.3 # 3rd Qu.:11.5 3rd Qu.:100 3rd Qu.:230.0 # Max. :15.0 Max. :100 Max. :250.0 ``` ```r tapply(X=dat$chol,INDEX=dat$dose,FUN=mean) ``` ``` # 0 50 100 # 210 216 208 ``` ] --- # Model specification: single mean .font140[ Alternatives: `$$y_{ij}=\beta_{0} + \beta_{1}x_{i}+\epsilon_{ij}$$` `$$y_{ij}=\beta_{0} + \beta_{1}x_{i}+\beta_{2}x_{i}^2+\epsilon_{ij}$$` Single mean model is a special case of the separate means model, a reduced or restricted version. ] --- # Example .font140[ ```r mod <- lm(chol~as.factor(dose), data=dat) summary(mod) ``` ``` # # Call: # lm(formula = chol ~ as.factor(dose), data = dat) # # Residuals: # Min 1Q Median 3Q Max # -36 -14 -8 20 34 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 210.00 11.55 18.187 4.22e-10 *** # as.factor(dose)50 6.00 16.33 0.367 0.720 # as.factor(dose)100 -2.00 16.33 -0.122 0.905 # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # Residual standard error: 25.82 on 12 degrees of freedom # Multiple R-squared: 0.02121, Adjusted R-squared: -0.1419 # F-statistic: 0.13 on 2 and 12 DF, p-value: 0.8793 ``` ] --- # Assumptions .font140[ The full model assumes that every treatment factor has its own mean response `\(\mu_i\)`. Combined with the error structure, the separate means model implies that all the data are independent and normally distributed with constant variance: `$$y_{ij} ∼ N(\mu_{i}, \sigma^2 )$$` `\(∼N(\mu_{i},\sigma^2)\)` denotes "has a normal distribution with mean 0 and variance `\(\sigma^2\)`". Here: Each treatment group may have its own mean. ] --- # Assumptions .font150[ We assume that errors `\(\epsilon^2\)` (deviations from the treatment means) are: - independent for different data values; - have mean zero; - have the same variance; - follow a Normal distribution. `$$\epsilon_{ij}\stackrel{iid}{\sim}N(0, \sigma^{2})$$` (iid independent and identically distributed) Normal distribution of errors needed for inference. ] --- # Groups mean and Overall mean .font130[ The group means `\(\mu_{i}\)` can be expressed as a sum of overall mean `\(\mu^*\)` and treatment effect `\(\alpha_{i}\)`, hence `$$\mu_{i}=\mu^*+\alpha_{i}$$` Therefore, `\(\mu^*\)` can be defined as the mean of the `\(g\)` treatments: `$$\mu^*=\sum_{i=1}^{g}\mu_{i}/g$$` The sum of treatment effects is zero: `$$\sum_{i=1}^{g}\alpha_{i}=0$$` Will be useful later for factorial design. ] --- # Weighted mean .font130[ Alternative we can assume that `\(\mu^*\)` is the weighted average of the treatment means, with the sample sizes `\(n_{i}\)` used as weights: `$$\mu^*=\sum_{i=1}^{g}n_{i}\mu_{i}/N$$` The weighted sum of the treatment effects is zero: `$$\sum_{i=1}^{g}n_{i}\alpha_{i}=0$$` When the sample sizes are equal, these two choices coincide. Restriction that the treatment effects `\(α_i\)` add to zero (either weighted or not) implies that the treatment effects are not completely free to vary. ] --- # Estimation of Parameters .font140[ Total Sum of Squares We usually choose model parameters so as to minimize the sum of squared residuals. The total sum of squares in the data `\(SS_T\)` is the sum of the model or explained sum of squares `\(SS_M\)` plus the error or residual sum of squares `\(SS_E\)` . `$$SS_T = SS_M + SS_E$$` For a fixed set of data, if you change the model making one `\(SS\)` bigger, then the other must get smaller. `\(SS_M\)` is variability explained by allowing means to change linearly with dose; `\(SS_E\)` is the variability which is left over or not explained. ] --- # Total Sum of Squares .font130[ The total sum of squares: `$$SS_{T} = \sum_{i=1}^{g}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y_{..}})^2$$` where, `\(\bar{y_{..}}\)` the grand mean of all observations is `$$\bar{y_{..}} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_{i}}y_{ij}$$` It is a mean of all observations which represents total variability in responses about an overall common mean `\(\bar{y}\)` ] --- # Model and Error Sum of Squares .font140[ For `\(\large y_{ij}=\mu_{i}+\epsilon_{ij}\)` the Model Sum of Squares is variability explained by allowing means to change within group: `$$SS_{M} = \sum_{i=1}^{g}\sum_{j=1}^{n_{i}}(\bar{y}_{i.}-\bar{y_{..}})^2$$` And Error Sum of Squares is the slop that is left over, i.e. the variability within groups. `$$SS_{E} = \sum_{i=1}^{g}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y_{i.}})^2$$` ] --- # Which model to use? .font140[ - Test whether the means are all the same or if some of them differ. - Check whether the data can be adequately described by the model of a single mean or the model of separate treatment group means is needed. - Recall that the single mean model is a special case of the group means model. Analysis of Variance (ANOVA), is a method for comparing the fit of two models, one a reduced version of the other. ] --- # Hypothesis test of no treatment effect .font140[ `\(H_{0}: \mu_{1}=\mu_{2}=\mu_{i}\)` The alternative `\(H_{0}\)` at least one `\(\mu_{i}\)` is different If `\(H_{0}\)` is true, the model `\(\large y_{ij}=\mu_{i}+\epsilon_{ij}\)` simplifies to `$$\large y_{ij}=\mu+\epsilon_{ij}$$` which can be represented as a single normal distribution with `$$y_{ij} ∼ N(\mu, \sigma^2 )$$` ] --- # Analysis of Variance table .font140[ Under the `\(H_{0}: \mu_{1}=\mu_{2}=\mu_{i}\)` both the `\(SS_{T}\)` and `\(SS_{E}\)` follow `\(\chi^{2}\)` [distribution](https://en.wikipedia.org/wiki/Chi-square_distribution). These Sums of Squares and their corresponding mean squares which are formed by dividing each Sum of Squares by its degrees of freedom are usually presented in ANOVA tables. <img src="figs/anova-Sum-Squares.png" style="width: 90%" /> Under the null hypothesis, the F-ratio follows the [F-distribution](https://en.wikipedia.org/wiki/F-distribution). with `\(t−1\)` and `\(n−t\)` [degrees of freedom](https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-degrees-of-freedom-in-statistics). ] --- # Example .font130[ ```r mod<-aov(chol~as.factor(dose), data=dat) summary(mod) ``` ``` # Df Sum Sq Mean Sq F value Pr(>F) # as.factor(dose) 2 173 86.7 0.13 0.879 # Residuals 12 8000 666.7 ``` `\(SS_{M}\)` is on the line labeled as `as.factor(dose)` and `\(SS_{E}\)` is on the line labeled as `Residuals`. We can compute `\(SS_{T}\)` by adding them. ] -- .font130[ Confirm that the degrees of freedom and F-value are correct. The last column labeled `Pr(>F)` is the probability of exceeding the calculated F-value if the null hypothesis is true. This is called the p-value. ] --- # Accept or reject? .font140[ - Choose the significance level `\(\alpha\)` for the hypothesis test - Reject the hypothesis `\(H_{0}\)` if the `Pr(>F)` value on the `aov` output is less than the chosen value of `\(\alpha\)`. ] -- .font140[ For the cholesterol experiment there are no significant differences among the mean cholesterol level for each dose level groupp at the significance level `\(\alpha = 0.05\)` since 0.879 > 0.05. We accept `\(H_{0}\)` and conclude that `\(\mu_{1}=\mu_{2}=\mu_{i}\)` are equal. ] --- # Occam’s razor .font140[ We seek the simplest model consistent with the data. *"All treatments have the same mean"* is simpler than *"Each treatment has its own mean."* If we cannot say that the complicated model is needed, we take the simple model. ] --- # Model is an approximation .font140[ Model is a gross approximation to reality. *All models are wrong; some models are useful.* — George Box We do not believe any model is really true, but if the data are consistent with it, we use it. That's why checking assumptions is essentials. ] --- # Lab .font130[ Read built-in dataset ```r # list built-in data data() # get help on a specific dataset ?PlantGrowth # Load in the workspace data(PlantGrowth) # list objects in the environment ls() # remove # rm(list=ls()) # see structure str(PlantGrowth) # see dimension dim(PlantGrowth) # see summary statistics summary(PlantGrowth) ``` ]