R: Basic statistics
Back to Local tips for R
The tests covered here are primarily those dealt with in the Cambridge NST 1B Psychology undergraduate statistics course (see 2004-5 handout PDF).
Basic maths covered in the handout:
sum(x) # adds up a vector x^y sqrt(x) log(x) # default is natural log log(x, base=y) # any sort of log log2(x) # short cut log10(x) # short cut exp(x) # e^x factorial(x)
R provides a large number of functions to deal with probability distributions. They come in sets. A typical set has a density function, a cumulative probability distribution function, a quantile function, and a random number generating function. For example, the dnorm/pnorm/qnorm/rnorm set deals with the normal distribution:
dnorm(x, mean=0, sd=1) # Density function. # For this example, it is 0 at x=-infinity, reaches a peak of 0.39 at x=0, and descends symmetrically to zero at x=infinity. As for all probability density functions, the total area under the curve is 1. pnorm(q, mean=0, sd=1) # Cumulative probability distribution function. # Returns P(X <= q), where X is a random variable of the specified normal distribution. Default mean=0; default SD=1. For example, pnorm(0) = 0.5; pnorm(1.644854) = 0.95. qnorm(p, mean=0, sd=1) # Quantile function. # Returns q such that P(X <= q) = p. For example, qnorm(0) = -Inf; qnorm(0.5) = 0; qnorm(0.95) = 1.644854; qnorm(0.975) = 1.959964. rnorm(n, mean=0, sd=1) # Random number generator. # Generate n random numbers from the specified normal distribution.
By the way, to see a function plotted, use curve:
curve(dnorm(x), from=-5, to=5)
Just a few other distribution functions (they work in the same general way):
[d|p|q|r]norm() # Normal distribution [d|p|q|r]t() # Student t distribution [d|p|q|r]f() # F distribution [d|p|q|r]chisq() # Chi-square distribution
Basic descriptive statistics:
mean() median() fivenum() # Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) summary() # Summarizes model fits, and simple data (for which it gives minimum, first quartile, median, mean, third quartile, maximum). min() max() quantile() # calculates any quantile (not just the default of 0%, 25%, 50%, 75%, 100%) var() # sample variance sd() # sample standard deviation # Now to summarize things by factor(s): install.packages("Hmisc") # if not yet done library(Hmisc) # get the Hmisc library, for its "summary.formula" function summary(depvar ~ factor(s), data=mydataframe) # because you've passed a formula to summary, it calls summary.formula
Counts, boxplots, histograms, and so forth:
table(x) # shows counts of each value in x (producing a unidimensional list of frequencies with the values as dimnames) table(x)/length(x) # shows relative frequencies barplot(table(x)) # by virtue of the way table() produces its results, this produces a frequency histogram hist(x) # a more general way of doing a histogram (also deals with continuous data). # Use "prob=TRUE" to plot relative frequencies (so the area under the graph is 1). # The "breaks" option can be used for manual control of the number of bins. stem(x) # Stem-and-leaf plot boxplot(x) # Box-and-whisker plot plot(ecdf(x)) # Plot the empirical cumulative distribution function for x qqnorm(x) # Creates a normal Q-Q plot (plotting the quantiles of x against the quantiles of a normal distribution). qqline(x) # Adds a reference line for the Q-Q plot.
plot(x, y) # Basic scatterplot pairs(d) # Creates scatterplots for every pair of variables in a dataframe d
cov(x, y) # Also possible to use cov(x), where x is a matrix/dataframe, to produce all possible covariances between variables (columns).
Pearson product-moment correlation coefficient, r:
cor(x, y) # Also possible to use cor(x), where x is a matrix/dataframe, to produce all possible correlations between variables (columns).
Correlation with t test of significance, plus confidence intervals:
To perform significance testing on a matrix of correlations, see http://tolstoy.newcastle.edu.au/R/help/01c/2272.html.
Spearman's rho (and significance test), a nonparametric test:
cor.test(x, y, method="spearman")
Proceed as follows. First, let's perform the regression. Use lm (for linear model):
fit <- lm(y ~ x) # y is predicted by x
Now, let's see what that did:
$names  "coefficients" "residuals" "effects" "rank"  "fitted.values" "assign" "qr" "df.residual"  "xlevels" "call" "terms" "model" $class  "lm"
Show the intercept and slope:
fit # this is the object we just stored the results in
Summary, with intercept/slope and significance testing by t test (also gives r2 and adjusted r2):
Significance testing by analysis of variance (regression ANOVA):
Confidence intervals (default: 95% confidence intervals) on the intercept/slope:
Scatterplot with regression line:
plot(x, y) abline(fit)
plot(x, fit$residuals) abline(0,0) # add a reference line at 0
Following on from the above...
Confidence intervals on predictions (prediction intervals) for new x values (example values of 1 and 5 used here):
predict.lm(fit, newdata=data.frame(x=c(1,5)), interval="prediction") # there's also interval="confidence", but that's something else (with much tighter bounds); I'm not sure exactly what.
# Quite a few ways to do partial correlations. A simple one is to put them all as variables in a data matrix, # and use partial.cor to get partial correlations between every pair, controlling for all other variables. library(Rcmdr) # partial.cor is part of the Rcmdr library library(car) # example data data(DavisThin) # example data partial.cor(DavisThin)
Point-biserial correlation and phi: see handout, and use cor() as normal!
Basic t-tests, equal variances not assumed, and with the Welsh df correction:
# Independent two-group t-test t.test(y ~ x) # where y is numeric and x is a binary factor # Independent two-group t-test t.test(y1, y2) # where y1 and y2 are numeric # Paired t-test t.test(y1, y2, paired=TRUE) # where y1 and y2 are numeric # One sample t-test t.test(y, mu=3) # null hypothesis: mean (mu) = 3 # t-test, within a data frame, optionally specifying a subset of the data t.test(y ~ x, data=mydataframe, subset=sex=="M") # where y is numeric, x is a binary factor, and you want to look at cases where "sex" is "M"
Confidence intervals are given by default.
To specify equal variances and a pooled variance estimate:
# T-test with equal variances assumed (one example): t.test(y1, y2, var.equal=TRUE)
To specify a one-tailed t-test, use this sort of modification:
# One-tailed t test: is y1 less than y2 (one example)? t.test(y1, y2, alternative="less") # ... or the other way round (don't do both!) t.test(y1, y2, alternative="greater")
F test to compare two variances:
# F test to compare two variances. var.test(y1, y2)
y <- c(1,2,3,4,100) median(y)
Mann-Whitney U test (or, Wilcoxon rank sum test - logically equivalent; see handout PDF above) for two independent samples:
# Mann-Whitney U test (or, Wilcoxon rank sum test) for two independent groups. Two ways of doing the same thing: y1 <- c(1.68, 3.83, 3.11, 2.76, 1.70, 2.79, 3.05, 2.66, 1.40, 2.775) # example data y2 <- c(2.94, 3.38, 4.90, 2.81, 2.80, 3.21, 3.08, 2.95) # example data wilcox.test(y1, y2, paired=FALSE) # the "paired=FALSE" is the default, so can be omitted # Can also use alternative="two.sided"/"greater"/"less" to specify null hypothesis. # The output gives W, the Wilcoxon rank sum statistic (use [d|p|q|r]wilcox to use distribution functions by hand, if you want). # The way R calculates W is the same method as given for the Mann-Whitney U in the PDF handout linked to above. y3 <- c(1.68, 3.83, 3.11, 2.76, 1.70, 2.79, 3.05, 2.66, 1.40, 2.775, 2.94, 3.38, 4.90, 2.81, 2.80, 3.21, 3.08, 2.95) # example data a <- factor(c(rep(1,10),rep(2,8))) wilcox.test(y3 ~ a)
Wilcoxon matched-pairs signed-rank test (for two related samples):
y1 <- c(130, 148, 170, 125, 170, 130, 130, 145, 119, 160) # example data y2 <- c(120, 148, 163, 120, 135, 143, 136, 144, 119, 120) # example data wilcox.test(y1, y2, paired=TRUE) # Options to specify one- or two-tailed tests are as above. # The output gives V, the Wilcoxon signed rank statistic (use [d|p|q|r]signrank to use distribution functions by hand, if you want).
Specify the contingency table in a matrix. We'll call it C. Note that R fills matrices down columns, then across rows, by default (though you can use "byrow=TRUE" to alter that).
C<-matrix(c(30,60,70,30), nrow=2) C # display the matrix
[,1] [,2] [1,] 30 70 [2,] 60 30
Then run the test, as shown below. If the matrix is a vector, a goodness-of-fit test is performed (with the null hypothesis that all categories are equally likely. If the matrix has at least two rows and columns, then a contingency test is performed - by default, the expected values are calculated as (row total x column total)/grand total, as is typical in such tests. Optionally, a second matrix can be given to specify the expected values explicitly. The correct parameter, true by default, specifies whether Yates' continuity correction is applied.
# Chi-square test chisq.test(C)
To see the observed and expected values, store the output and interrogate it:
result<-chisq.test(C) result$observed result$expected
Note that if some expected values are <5, it is preferable to use Fisher's exact test:
# Fisher's Exact Test for count data fisher.test(C)
Is my die true? Another chi-square example:
counts <- c(41, 50, 67, 47, 25, 50) # rolls of a die: number of ones, twos, threes... probs <- rep(1/6, 6) chisq.test(counts, p = probs) # actually, specifying these particular probabilities is optional, since the default is to test against equiprobable categories anyway