Terms

p-value

the possibility of finding the observed, or more extreme, statistic when the null hypothesis is true. ##### frequentist vs Bayesian approaches
Frequentist inference is based on frequentist probability. It treats probability in equivalent term to frequency. The data is being used only from the current experiment. Some common type of frameworks: - p-value - confident interval - hypothesis testing

_In Baysian approach, the probability expresses a degree of belive in an event. The prior, which can be knowledge from previous experiments or personal belives, will be applied into the baysian inference.

Bayes theorem

\(P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}\)

reproducibility crisis

Refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigators Solution: pre-registration, learn how to do statistic,

statistic

Statistic are no substitute for common sense Ex: pressure to publish, p hacking, unethical researchers, small sample size, file drawer problem, p hacking: discarding data, faking,

parameter

Population variables that we are trying to estimate

continuous variable

quantitative or measurment variables. ex: height, weight

discrete variable

Discrete variables are countable values. ex: race

experimental vs observational studies

Experimental studies are ones where researchers introduce an intervention and study the effects. Experimental studies are usually randomized, meaning the subjects are grouped by chance.

Observational studies are ones where researchers observe the effect of a risk factor, diagnostic test, treatment or other intervention without trying to change who is or isn’t exposed to it.

𝜇

Population mean

𝜎

Population standard deviation

𝑌%

sample mean

𝑠

Sample standard deviation

blinding

Blind: the subject does not know whether it is experimental or control subject Double blind: neither researcher nor subject know which subjects are experimental or control

pseudoreplication

It happens when the apparent sample size is larger than true sample size

biological vs technical replicates

A biological replicate involves a new, independent test subject A technical replicate involves repeating same procedure on a new sample size from the same project

outliers

In statistics, an outlier is a data point that differs significantly from other observations.

confounding variables

A confounding variable is a third variable that influences both the independent and dependent variables.

common faults in plots
sample

A subset of individuals drawn from a population

population

Some sort of group of something

transformation

transform your data to be close normal distributed in order to do statistical test.

parametric

Parametric statistics are based on assumptions about the distribution of population from which the sample was taken.

non-parametric

Nonparametric statistics are not based on assumptions, that is, the data can be collected from a sample that does not follow a specific distribution.

MCMC

Example problems

I set the seed to 33, so you can get the same results if there is sampling function.

set.seed(33)

R skills you should have

Create matrices, vectors, dataframes, and lists

creating an empty matrix and dding data.

ma <- matrix(1:16, 4,4)
ma
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
creating a empty vactor and add data
a <- c()
a <- sample(10, 5, replace = T)
a
## [1] 10  8  6  9  2
a <- c(a, c(sample(100:200, 5, replace = T)))
a
##  [1]  10   8   6   9   2 191 152 200 147 128
dataframe

adding data in dataframe

f <- data.frame(sample(100:200, 5, replace = T), c('a','b','c','d','e'))
colnames(f) <- c("mpg", 'brand')
f
##   mpg brand
## 1 138     a
## 2 141     b
## 3 195     c
## 4 185     d
## 5 131     e
creating a list

list is special because it can take characters, logic, or a vector.

ls <- list(c('a','b','c'), c(T,F,T,T,F), c(sample(10,7, replace = T)))
Subset each of these objects

choosing data range or subsetting
if it is a list or vector, use double brackets or single brackets to choose the data.

a <- a[-c(1:5)]
a
## [1] 191 152 200 147 128

If you want to choose the elements in list, you need to use double brackets, then single brackets.

ls[[1]][2]
## [1] "b"
ls[[3]][1]
## [1] 2

On the other hand, for the dataframe and matrix, you need to specify row and column. Row first, then column.

f[4,1]
## [1] 185
ma[3,3]
## [1] 11

and you also can edit the data.

f[4,1] <- 50
ma[3,3] <- 111
f[4,1]
## [1] 50
ma[3,3]
## [1] 111
Read a csv file to import data
read.csv("your data path")
Make a basic plot of 1, 2, or 3 variables that have a mix of continuous and discrete values

simulate a data set about 3 different species of stag beetles and their L3 larval body weight and adult mandible size.

aw <- round(rnorm(20, 15, sd = 5), digits = 2)
bw <- round(rnorm(20, 35, sd = 10), digits = 2)
cw <- round(rnorm(20, 50, sd = 15), digits = 2)
am <- round(rnorm(20, 20, sd = 3), digits = 2)
bm <- round(rnorm(20, 30, sd = 5), digits = 2)
cm <- round(rnorm(20,45, sd = 10),digits = 2)
dat <- data.frame(c(aw, bw,cw),c(am,bm,cm),c(rep('a',20),rep('b',20), rep('c',20)))
colnames(dat) <- c('Body weight', 'Mandible length', 'Species')
plot(dat$`Mandible length`~dat$`Body weight`, pch = 16,
     col = c(1:3)[as.factor(dat$Species)],
     xlab = 'Body weight',ylab = 'Mandible length')

Perform a permutation or Monte Carlo test
Perform and correctly interpret the statistical tests mentioned below Evaluate an MCMC log file

check example problems

R functions you should handle with ease

binom.test

Two tailed binomial test Your are rolling a dice and it lands on #6 8 times in 24 trails. You want to know if the probability of landing in each number is 1/6.

binom.test(8,24,1/6)
## 
##  Exact binomial test
## 
## data:  8 and 24
## number of successes = 8, number of trials = 24, p-value = 0.04802
## alternative hypothesis: true probability of success is not equal to 0.1666667
## 95 percent confidence interval:
##  0.1563023 0.5532196
## sample estimates:
## probability of success 
##              0.3333333

p-value is 0.048. So, the probability for each side is 1/6 is true.
Tossing a coin, and you observed 7 heads. If head and tail probability is 0.5. Null is the probability is not 0.5.

binom.test(7,10, .5)
## 
##  Exact binomial test
## 
## data:  7 and 10
## number of successes = 7, number of trials = 10, p-value = 0.3438
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3475471 0.9332605
## sample estimates:
## probability of success 
##                    0.7

p-value is 0.34 so we fail to reject the null.

chisq.test

The Chi-Square test is a statistical procedure used by researchers to examine the differences between categorical variables in the same population. Example: you want to know if male and femeal tend to drive different brands of vehicles. simulate a dataset and run chi-square test

dat <- matrix(c(160, 74, 30,75,42,27), 3,2,
              dimnames = list(c('a','b','c'), c('male',"female")))
chisq.test(dat)
## 
##  Pearson's Chi-squared test
## 
## data:  dat
## X-squared = 4.8561, df = 2, p-value = 0.08821
t.test(single sample, two sample, paired)
aov
lm
glm
prcomp

Example Problems

1. Suppose you are studying a pair of cryptic species. In your area 5% of individuals are species A and 95% of individuals are species B. There is currently no genetic assay capable of telling them apart. They differ however in the frequency of a rare color pattern. Species A has the rare color pattern 50% of the time while species B has the rare color pattern only 2% of the time. Assume these numbers are known with certainty, from many years of field research. Now suppose you find one of these species with the rare color pattern. Use Bayes theorem to compute the probability that it is from species A.
prob. in area rare pattern
A 0.05 0.5
B 0.95 0.02

\(P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}\)
P(A) = species A; P(B) = rare pattern

\(P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}\)

\(P(A \mid B)=\frac{0.5*0.05}{0.95 * 0.02 +0.05 * 0.5}\)

\(P(A \mid B)=0.5681\)

The probability that you find a rare pattern and it is from species A is 56.8%.

2. Download the two mcmc log files from the course website. Choose the MCMC that represents a “good” run? Provide a description of the rate parameter for codon2 and codon3.
datmcmc1 <- read.csv("mcmc.log.csv")
datmcmc2 <- read.csv('mcmc.log2.csv')
plot(datmcmc1$likelihood, type = 'l', ylim = c(-5000, -4800))
lines(datmcmc2$likelihood, col = 'red')

From the plot we can see the mama1 likelihood increases and gets stable in the very beginning, so i would take the last half data. mcmc2 is very similar, but it is not stable in the end.

If you plot them in a plot, you will notice neither one is good.

mean(datmcmc1$codon2[5000:10000])
## [1] 0.5281915
mean(datmcmc1$codon3[5000:10000])
## [1] 9.449001
3.Grasshoppers recover movement of a leg after nerve damage. Download the grasshopper dataset from the course website. It has four columns that describe range of motion before injury, directly after injury, after a 2- week recovery, and then after crushing the primary nerve a second time. The grasshoppers could recover movement by repair to the crushed nerve if so then crushing a second time should cause them to lose range of motion. However, if they are recovering range of motion by utilizing other nerves serving the legs then the recrushing should have no impact on range of motion. Determine whether the grasshoppers are repairing the damaged nerve or using alternate pathways to recover range of motion.
datgrasshoppers <- read.csv("Downloads/grasshopper.csv")
boxplot(datgrasshoppers)

For my understanding, we need to compare the second re-crush and postcrush

If grasshoppers repair the damaged nerve, the second damage will cause loss of motion

If grasshoppers use alternate pathways to recover, the second damage will not cause loss of motion

take a look the data, and I think this is very obvious that second crushing causes grasshoppers loss of motion In order to compare if post crush and re-crush are different, t-test might be a good option. However, one of the assumptions of t-test is normal distributed.

hist(datgrasshoppers$postcrush)

hist(datgrasshoppers$recrush)

This is not normal distributed, so we need to use another way to test.

So, I’d go with permutation. My null is there is no difference between post-crush and re-crush

emmean <- mean(datgrasshoppers$postcrush)-mean(datgrasshoppers$recrush)
emmean
## [1] 0.08
mo <- c(datgrasshoppers$postcrush, datgrasshoppers$recrush)
null <- c()
for (i in 1:100000){
  r <- sample(mo)
  null[i]<-mean(r[1:30])-mean(r[31:60])
  
}
plot(density(null))
abline(v=abs(emmean), col = 'red')

pval<-sum(abs(null)>=abs(emmean))/100000
pval
## [1] 0.9403

The p value is 0.93 which means that we fail to reject the null hypothesis. There is no difference between post-crush and re-crush. grasshoppers repair the damaged nerve.