class: center, middle, inverse, title-slide .title[ # Design and Analysis of Experiments Using R ] .subtitle[ ## Completely Randomised Design ] .author[ ### Olga Lyashevska ] .date[ ### 2022-10-02 ] --- # This week <img src=figs/roadmap-crd.png style="width: 90%" /> --- # Good place to start This is the basic experimental design; everything else is a modification. The CRD is - Easiest to do. - Easiest to analyse. - Most resilient when things go wrong. - Often sufficient. Consider a CRD first when designing. --- # CRD In a completely randomised design with one treatment factor, experimental units ( `\(n\)` ) are divided randomly into groups ( `\(g\)` ). Each group is then subject to one of the unique levels or values of the treatment factor. This design is used when there is only one factor under study and the experimental units are *homogeneous*. --- # Definition Completely randomised design (CRD) recipe: 1. Fix sample sizes `\(n_{1} , n_{2} , . . . , n_{g}\)` with `\(n_{1} + n_{2} + · · · + n_{g} = N\)`. 2. Randomly assign `\(n_{1}\)` units to treatment 1, `\(n_{2}\)` units to treatment 2, etc. All possible arrangements of the `\(N\)` units into `\(G\)` groups with sizes `\(n_{1}\)` though `\(n_{g}\)` *equally likely*. Selection of treatments, experimental units, and responses also important. --- # Example A pharmaceutical company conducts an experiment to test the effect of a new cholesterol medication. The company selects 15 subjects randomly from a larger population. Each subject is randomly assigned to one of three treatment groups. Within each treament group, subjects receive a different dose of the new medication. Group 1 --> 0 mg/day; Group 2 --> 50 mg/day; Group 3 --> 100 mg/day --- # Define - The treatment factor/Independent variable -- ``` Answer: Medication dose ``` -- - The experimental units -- ``` Answer: Subject ``` -- - The response/Dependent variable -- ``` Answer: Level of cholesterol ``` --- # Replication - Replicates -- ``` Answer: 5 subjects per Group ``` -- Why do we want to replicate? -- To calculate the variance of the experimental error from the data, this variance will be compared to the treatment effects. --- # Replication If the variability among the treatment means is less than the experimental error variance, the treatment differences are probably due to differences of the experimental units assigned to each treatment. Without replication it is impossible to tell if treatment differences are real or just a random manifestation of the particular experimental units used in the study. --- # Randomisation The random division of experimental units into groups. Why do we want to randomise? -- To guarantee the validity of the experiment against biases caused by other hidden variables. --- # Exercise Constructing randomised data: ```r set.seed(345) f <- factor(rep(c(0,50,100), each = 5)) # f <- factor(rep(seq(0, 100, by=50), each=5)) levels(f) ``` ``` # [1] "0" "50" "100" ``` ```r fac <- sample(f, 15) eu <- c(1:15) design <- data.frame(subject=eu, dose = fac) write.csv(design, file="outputs/CholesterolExperiment.csv", row.names=FALSE) ``` --- # Exercise ```r print(design, row.names=FALSE) ``` ``` # subject dose # 1 100 # 2 50 # 3 0 # 4 0 # 5 0 # 6 0 # 7 50 # 8 0 # 9 50 # 10 100 # 11 100 # 12 50 # 13 100 # 14 50 # 15 100 ``` --- # Exercise Open `CholesterolExperiment.csv` file and create column `chol` and manually populate it with values. - less than 200 mg/deciliter --> normal - 200-239 --> borderline high - more than 240 --> high Save it as `CholesterolExperimentFilled.csv` -- We can also populate data programmatically. --- # Exploratory Data Analysis Experimental design is a combination of domain knowledge and statistical methods. It is always a good idea to simply look at your data before modeling. For a CRD we have data by group; therefore looking at side-by-side boxplots, and summary statistics by group is a good idea. -- - Simple descriptive statistics such as means, medians, standard errors, interquartile ranges; - Plots, such as Quantile-Quantile plots (QQ-plot), boxplots, and scatterplots; Ask yourself: are data approximately normal? Are the variances roughly constant? --- # Read in file ```r data<-read.csv("outputs/CholesterolExperimentFilled.csv") summary(data) ``` ``` # subject dose chol # Min. : 1.0 Min. : 0 Min. :180.0 # 1st Qu.: 4.5 1st Qu.: 0 1st Qu.:195.0 # Median : 8.0 Median : 50 Median :200.0 # Mean : 8.0 Mean : 50 Mean :211.3 # 3rd Qu.:11.5 3rd Qu.:100 3rd Qu.:230.0 # Max. :15.0 Max. :100 Max. :250.0 ``` --- # Cross tabulation ```r table(data$chol) ``` ``` # # 180 190 200 210 220 240 250 # 2 2 4 1 2 2 2 ``` ```r # table(data$dose) table(data$chol, data$dose) ``` ``` # # 0 50 100 # 180 0 2 0 # 190 1 0 1 # 200 2 0 2 # 210 0 0 1 # 220 1 1 0 # 240 1 0 1 # 250 0 2 0 ``` --- # Measures of central tendency Measures of central tendency describe the middle of the data, where most observations lie. ```r mean(data$chol) ``` ``` # [1] 211.3333 ``` ```r median(data$chol) ``` ``` # [1] 200 ``` ```r quantile(data$chol, prob=0.5) ``` ``` # 50% # 200 ``` --- # Measures of central tendency The mean is commonly used, but it is overly influenced by extreme outliers. Example: Does the mean of this numbers reflect the ‘middle’ of the data? `1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1000` -- ```r mean(c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1000)) ``` ``` # [1] 101.35 ``` ```r median(c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1000)) ``` ``` # [1] 1.55 ``` --- # Measures of spread Measures of spread describe how far the data are scattered around the central tendency. ```r sd(data$chol) ``` ``` # [1] 24.16215 ``` ```r var(data$chol) ``` ``` # [1] 583.8095 ``` ```r IQR(data$chol) ``` ``` # [1] 35 ``` --- # Standard deviation The standard deviation quantifies spread about the mean. Like the mean it is influenced by outliers. The square of the standard deviation is called the variance. ```r var(data$chol) ``` ``` # [1] 583.8095 ``` ```r # same as sd(data$chol)^2 ``` ``` # [1] 583.8095 ``` --- # IQR The inter-quartile range quantify spread about the median and is more robust to outliers than the standard deviation. The inter-quartile range is defined as the difference between the 25% and 75% quantiles. ```r IQR(data$chol) ``` ``` # [1] 35 ``` ```r # same as quantile(data$chol,0.75)-quantile(data$chol, 0.25) ``` ``` # 75% # 35 ``` --- # Summary statistics per group ```r tapply(data$chol,data$dose,summary) ``` ``` # $`0` # Min. 1st Qu. Median Mean 3rd Qu. Max. # 190 200 200 210 220 240 # # $`50` # Min. 1st Qu. Median Mean 3rd Qu. Max. # 180 180 220 216 250 250 # # $`100` # Min. 1st Qu. Median Mean 3rd Qu. Max. # 190 200 200 208 210 240 ``` ```r # tapply(data$chol,data$dose,mean) ``` --- # Plotting ```r plot(data) ``` <!-- --> ```r # here same as # pairs(data) ``` --- # QQ-plot A QQ plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. ```r qqplot(rnorm(100), rnorm(100)) ``` <!-- --> ```r # qqplot(runif(100), runif(100)) ``` --- # Scatterplot `pairs(data)` produces a pairwise scatterplot matrix of the variables defined by the columns of `data`, that is, every column of `data` is plotted against every other column of `data`. ```r plot(data$chol, data$dose, pch=20, cex=2) # add labels identify(data$chol, data$dose) ``` --- # Histogram ```r hist(data$chol, col="blue") ``` <!-- --> ```r # hist(data$chol, prob=TRUE) ``` --- # Barplot ```r barplot(data$chol, col="blue") ``` <!-- --> --- # Boxplot ```r boxplot(data$chol ~ data$dose, col="blue") ``` <!-- --> ```r # boxplot(chol ~ dose, data=data, col="blue") # ``` --- # Boxplot <img src=https://miro.medium.com/max/702/1*NRlqiZGQdsIyAu0KzP7LaQ.png style="width: 70%" /> --- # Useful R function .font150[ - `?function` or `help(function)` or `?help` - `??function` - `example(function)` - `demo()` ] --- # Saving a plot ```r png("outputs/Plot_name.png", width = 500, height = 500, res = 72) boxplot(data$chol ~ data$dose, col="blue", main = "Cholesterol level") dev.off() ``` ``` # png # 2 ```