the possibility of finding the observed, or more extreme,
statistic when the null hypothesis is true. ##### frequentist vs
Bayesian approaches
Frequentist inference is based on frequentist probability. It treats
probability in equivalent term to frequency. The data is being
used only from the current experiment. Some common type of
frameworks: - p-value - confident interval -
hypothesis testing
_In Baysian approach, the probability expresses a degree of belive in an event. The prior, which can be knowledge from previous experiments or personal belives, will be applied into the baysian inference.
\(P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}\)
Refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigators Solution: pre-registration, learn how to do statistic,
Statistic are no substitute for common sense Ex: pressure to publish, p hacking, unethical researchers, small sample size, file drawer problem, p hacking: discarding data, faking,
Population variables that we are trying to estimate
quantitative or measurment variables. ex: height, weight
Discrete variables are countable values. ex: race
Experimental studies are ones where researchers introduce an intervention and study the effects. Experimental studies are usually randomized, meaning the subjects are grouped by chance.
Observational studies are ones where researchers observe the effect of a risk factor, diagnostic test, treatment or other intervention without trying to change who is or isn’t exposed to it.
Population mean
Population standard deviation
sample mean
Sample standard deviation
Blind: the subject does not know whether it is experimental or control subject Double blind: neither researcher nor subject know which subjects are experimental or control
It happens when the apparent sample size is larger than true sample size
A biological replicate involves a new, independent test subject A technical replicate involves repeating same procedure on a new sample size from the same project
In statistics, an outlier is a data point that differs significantly from other observations.
A confounding variable is a third variable that influences both the independent and dependent variables.
A subset of individuals drawn from a population
Some sort of group of something
transform your data to be close normal distributed in order to do statistical test.
Parametric statistics are based on assumptions about the distribution of population from which the sample was taken.
Nonparametric statistics are not based on assumptions, that is, the data can be collected from a sample that does not follow a specific distribution.
I set the seed to 33, so you can get the same results if there is sampling function.
set.seed(33)
creating an empty matrix and dding data.
ma <- matrix(1:16, 4,4)
ma
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
a <- c()
a <- sample(10, 5, replace = T)
a
## [1] 10 8 6 9 2
a <- c(a, c(sample(100:200, 5, replace = T)))
a
## [1] 10 8 6 9 2 191 152 200 147 128
adding data in dataframe
f <- data.frame(sample(100:200, 5, replace = T), c('a','b','c','d','e'))
colnames(f) <- c("mpg", 'brand')
f
## mpg brand
## 1 138 a
## 2 141 b
## 3 195 c
## 4 185 d
## 5 131 e
list is special because it can take characters, logic, or a vector.
ls <- list(c('a','b','c'), c(T,F,T,T,F), c(sample(10,7, replace = T)))
choosing data range or subsetting
if it is a list or vector, use double brackets or single brackets to
choose the data.
a <- a[-c(1:5)]
a
## [1] 191 152 200 147 128
If you want to choose the elements in list, you need to use double brackets, then single brackets.
ls[[1]][2]
## [1] "b"
ls[[3]][1]
## [1] 2
On the other hand, for the dataframe and matrix, you need to specify row and column. Row first, then column.
f[4,1]
## [1] 185
ma[3,3]
## [1] 11
and you also can edit the data.
f[4,1] <- 50
ma[3,3] <- 111
f[4,1]
## [1] 50
ma[3,3]
## [1] 111
read.csv("your data path")
simulate a data set about 3 different species of stag beetles and their L3 larval body weight and adult mandible size.
aw <- round(rnorm(20, 15, sd = 5), digits = 2)
bw <- round(rnorm(20, 35, sd = 10), digits = 2)
cw <- round(rnorm(20, 50, sd = 15), digits = 2)
am <- round(rnorm(20, 20, sd = 3), digits = 2)
bm <- round(rnorm(20, 30, sd = 5), digits = 2)
cm <- round(rnorm(20,45, sd = 10),digits = 2)
dat <- data.frame(c(aw, bw,cw),c(am,bm,cm),c(rep('a',20),rep('b',20), rep('c',20)))
colnames(dat) <- c('Body weight', 'Mandible length', 'Species')
plot(dat$`Mandible length`~dat$`Body weight`, pch = 16,
col = c(1:3)[as.factor(dat$Species)],
xlab = 'Body weight',ylab = 'Mandible length')
check example problems
Two tailed binomial test Your are rolling a dice and it lands on #6 8 times in 24 trails. You want to know if the probability of landing in each number is 1/6.
binom.test(8,24,1/6)
##
## Exact binomial test
##
## data: 8 and 24
## number of successes = 8, number of trials = 24, p-value = 0.04802
## alternative hypothesis: true probability of success is not equal to 0.1666667
## 95 percent confidence interval:
## 0.1563023 0.5532196
## sample estimates:
## probability of success
## 0.3333333
p-value is 0.048. So, the probability for each side is 1/6 is
true.
Tossing a coin, and you observed 7 heads. If head and tail probability
is 0.5. Null is the probability is not 0.5.
binom.test(7,10, .5)
##
## Exact binomial test
##
## data: 7 and 10
## number of successes = 7, number of trials = 10, p-value = 0.3438
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3475471 0.9332605
## sample estimates:
## probability of success
## 0.7
p-value is 0.34 so we fail to reject the null.
The Chi-Square test is a statistical procedure used by researchers to examine the differences between categorical variables in the same population. Example: you want to know if male and femeal tend to drive different brands of vehicles. simulate a dataset and run chi-square test
dat <- matrix(c(160, 74, 30,75,42,27), 3,2,
dimnames = list(c('a','b','c'), c('male',"female")))
chisq.test(dat)
##
## Pearson's Chi-squared test
##
## data: dat
## X-squared = 4.8561, df = 2, p-value = 0.08821
prob. in area | rare pattern | |
---|---|---|
A | 0.05 | 0.5 |
B | 0.95 | 0.02 |
\(P(A \mid B)=\frac{P(B \mid
A)P(A)}{P(B)}\)
P(A) = species A; P(B) = rare pattern
\(P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}\)
\(P(A \mid B)=\frac{0.5*0.05}{0.95 * 0.02 +0.05 * 0.5}\)
\(P(A \mid B)=0.5681\)
The probability that you find a rare pattern and it is from species A is 56.8%.
datmcmc1 <- read.csv("mcmc.log.csv")
datmcmc2 <- read.csv('mcmc.log2.csv')
plot(datmcmc1$likelihood, type = 'l', ylim = c(-5000, -4800))
lines(datmcmc2$likelihood, col = 'red')
From the plot we can see the mama1 likelihood increases and gets stable in the very beginning, so i would take the last half data. mcmc2 is very similar, but it is not stable in the end.
If you plot them in a plot, you will notice neither one is good.
mean(datmcmc1$codon2[5000:10000])
## [1] 0.5281915
mean(datmcmc1$codon3[5000:10000])
## [1] 9.449001
datgrasshoppers <- read.csv("Downloads/grasshopper.csv")
boxplot(datgrasshoppers)
For my understanding, we need to compare the second re-crush and postcrush
If grasshoppers repair the damaged nerve, the second damage will cause loss of motion
If grasshoppers use alternate pathways to recover, the second damage will not cause loss of motion
take a look the data, and I think this is very obvious that second crushing causes grasshoppers loss of motion In order to compare if post crush and re-crush are different, t-test might be a good option. However, one of the assumptions of t-test is normal distributed.
hist(datgrasshoppers$postcrush)
hist(datgrasshoppers$recrush)
This is not normal distributed, so we need to use another way to test.
So, I’d go with permutation. My null is there is no difference between post-crush and re-crush
emmean <- mean(datgrasshoppers$postcrush)-mean(datgrasshoppers$recrush)
emmean
## [1] 0.08
mo <- c(datgrasshoppers$postcrush, datgrasshoppers$recrush)
null <- c()
for (i in 1:100000){
r <- sample(mo)
null[i]<-mean(r[1:30])-mean(r[31:60])
}
plot(density(null))
abline(v=abs(emmean), col = 'red')
pval<-sum(abs(null)>=abs(emmean))/100000
pval
## [1] 0.9403
The p value is 0.93 which means that we fail to reject the null hypothesis. There is no difference between post-crush and re-crush. grasshoppers repair the damaged nerve.