5 Probability distributions
5.1 What are they? Full info
We have talked about several distributions and tests so far in the lab. To see the help files for most of them, see ?Distributions
.
Expand to see the current list
For the normal distribution see
dnorm
.For the Poisson distribution see
dpois
.For the Student’s t distribution see
dt
.For the uniform distribution see
dunif
.For the beta distribution see
dbeta
.For the binomial (including Bernoulli) distribution see
dbinom
.For the Cauchy distribution see
dcauchy
.For the chi-squared distribution see
dchisq
.For the exponential distribution see
dexp
.For the F distribution see
df
.For the gamma distribution see
dgamma
.For the geometric distribution see
dgeom
. (This is also a special case of the negative binomial.)For the hypergeometric distribution see
dhyper
.For the log-normal distribution see
dlnorm
.For the multinomial distribution see
dmultinom
.For the negative binomial distribution see
dnbinom
.For the Weibull distribution see
dweibull
.For less common distributions of test statistics see
pbirthday
,dsignrank
,ptukey
anddwilcox
(and see the ‘See Also’ section ofcor.test
).
5.2 The normal distribution
Remember as we discussed in lectures, we normally state that a variable modelled using by a normal distribution is described by:
X∼N(μ,σ2)
In this expression:
X is the random variable.
∼ means “is distributed as.”
N represents the normal distribution.
μ is the mean of the distribution.
σ2 is the VARIANCE of the distribution.
In R commands, you need the standard deviation instead. (you can google how to get the sd from the variance if you have forgotten)
5.2.2 Generate a random sample
To generate a random sample from a normal distribution we use rnorm:
# random sample of size 100
sample.normal <- rnorm(n=100,mean=4,sd=2)
5.2.3 Calculate probability when given a z-score
To calculate a z score from your sample/population, you can use R as a calculator.
To calculate the probability of greater/lesser than a value in a given normal distribution (e.g. you can use this as an interactive table)
# probability of less than 1.7 in a normal distribution with mean 4 and standard deviation = 2
pnorm(1.7,mean=4,sd=2,lower.tail = TRUE)
## [1] 0.1250719
# probability of GREATER than 1.8 in a normal distribution with mean 4 and VARIANCE = 9
1 - pnorm(1,mean=4,sd=3,lower.tail = TRUE)
## [1] 0.8413447
# or
pnorm(1,mean=4,sd=2,lower.tail = FALSE)
## [1] 0.9331928
5.2.4 Calculate z-score when given a probability
Inversely, to calculate the z-score for a given probability
# what value is less than 60% of the data?
qnorm(0.6,mean=4,sd=2,lower.tail = TRUE)
## [1] 4.506694
# what value is greater than 80% of the data?
qnorm(0.8,mean=4,sd=2,lower.tail = FALSE)
## [1] 2.316758
5.2.5 Testing normality
5.2.5.1 Wilks Shapiro test for normality
To test for normality:
First, have a look at the histogram! Here is the code for the Shapiro-Wilk test.
shapiro.test(HousesNY$Price)
##
## Shapiro-Wilk normality test
##
## data: HousesNY$Price
## W = 0.96341, p-value = 0.1038
There are many online tutorials for interpretation
5.2.5.2 QQ-Norm plot
You can also make a QQ-Norm plot. Instal the ggpubr package, add it to your library code chunk and run.
YOU CAN INTERPRET IT HERE: https://www.learnbyexample.org/r-quantile-quantile-qq-plot-base-graph/
5.3 T-distribution
What even is this? See this nice resource: https://365datascience.com/tutorials/statistics-tutorials/students-t-distribution/
To see the help file for all these:
?TDist
5.3.1 Calculate a probability given a T-Statistic
To calculate a t-statistic from your sample/population, you can use R as a calculator. To calculate the probability of greater/lesser than a value in a given t-distribution (e.f. you can use this as an interactive t-table)
# probability of seeing less than 1.7 in a t-distribution
# with 20 degrees of freedom
pt(1.55,df=20,lower.tail = TRUE)
## [1] 0.9315892
5.3.2 Calculate a T-Statistic for a given probability
To calculate the value for a given probability
# what value is greater than 90% of the data in a t-distribution with df=25
qt(0.9,df=25,lower.tail = TRUE)
## [1] 1.316345
5.3.3 One sided T-test
To conduct a full t-test on some data:
# Conduct a two-sided t-test where we think that the data comes from a T-distribution with mean 100.
t.test(HousesNY$Price,mu=100,alternative="two.sided")
##
## One Sample t-test
##
## data: HousesNY$Price
## t = 2.3954, df = 52, p-value = 0.02024
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
## 102.2125 125.0516
## sample estimates:
## mean of x
## 113.6321
or see the detailed tutorial here: http://www.sthda.com/english/wiki/one-sample-t-test-in-r for one-sample
and here for comparing two samples: http://www.sthda.com/english/wiki/unpaired-two-samples-t-test-in-r