This lecture covers the following section in the An Introduction to R:
The basic idea in R function writing can be demonstrated by the following example which generates the mean and variance of n random normal variates.
mixsiml <- function (n) {
x <- c(1:n);
for(i in c(1:n)) {x[i] <- 0.5*(rnorm(1)+3*rnorm(1))}
mx <- mean (x); vx <- var (x)
cat(" mean of x is ",mx, "\n")
cat(" var of x is ",vx,"\n") } # "\n" for a new line
mixsiml(20)
mean of x is 1.083920 var of x is 2.487287To output values as a data set, use "list".:
mixsim2 <- function (n){
x <- c(1:n)
y <- c(1:n)
for(i in c(1:n)){
x[i] <- 0.5* (rnorm(1)+3*rnorm(1))
y[i] <- x[i]*2.0
}
list (x1=x, yl=y)
}
a <- mixsim2(4)
a
$x1: [1] -1.6163232 -1.4401603 -1.2824844 0.3978134 $yl: [1] -3.2326465 -2.8803205 -2.5649688 0.7956267
> a$x1
[1] -1.6163232 -1.4401603 -1.2824844 0.3978134
In other words, the outputs are divided into sub-categories a$x1 and a$y1.
It is my experience, the R is not a good for writing long programs. The error messages are just too vague. One way that helps debugging is to use cat() inside function to get detailed intermediate values. Fortunately, R has thousands of canned programs. Thus, a short program is usually enough to do many statistical jobs. A list of functions can be found in Functions & Datasets . Here are examples.> A<-scan() > 79.98 80.04 80.02 80.04 80.03 80.03 80.004 79.97 > 80.05 80.03 80.02 80.00 80.02 > B<-scan() > 80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97 > library(ctest) > t.test(A,B) # Welch Two sample t-test
Welch Two Sample t-test data: A and B t = 3.0467, df = 11.899, p-value = 0.01024 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01115442 0.06734558 sample estimates: mean of x mean of y 80.01800 79.97875
> var.test(A,B) #F test to compare two variances
F test to compare two variances
data: A and B
F = 0.5678, num df = 12, denom df = 7, p-value = 0.3713
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1216915 2.0477498
sample estimates:
ratio of variances
0.5677919
> t.test(A,B,var.equal=TRUE) #Two Sample t-test assuming equal variance
Two Sample t-test data: A and B t = 3.2658, df = 19, p-value = 0.004067 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01409501 0.06440499 sample estimates: mean of x mean of y 80.01800 79.97875
wilcox.test(A,B) #Wilcoxon rank sum test with continuity correction
Wilcoxon rank sum test with continuity correction data: A and B W = 87, p-value = 0.01157 alternative hypothesis: true mu is not equal to 0
Descriptionlm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).
Usage lm(formula, data, subset, weights, ...) Arguments formula a symbolic description of the model to be fit. data an optional data frame containing the variables in the model. weights an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. An example: Morley's measurements on the speed of light (last three digits in 299***Km/sec.) There are 5 experiments (Expt) each with 20 replicates (Run). > data(morley) > Morley
Expt Run Speed
1 1 1 850
2 1 2 740
3 1 3 900
4 1 4 1070
5 1 5 930
6 1 6 850
7 1 7 950
..................
95 5 15 810
96 5 16 940
97 5 17 950
98 5 18 800
99 5 19 810
100 5 20 870
> morley$Expt<-factor(morley$Expt) # Define Expt as a factor (discrete) > attach(morley) > fm<-lm(Speed~Run+Expt, data=morley) # Run is assumed to be continuous. > summary(fm) #print the outputs
Residuals:
Min 1Q Median 3Q Max
-257.768 -43.329 1.926 42.250 158.713
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 912.6947 21.5119 42.427 < 2e-16 ***
Run -0.3519 1.2937 -0.272 0.786222
Expt2 -53.0000 23.5900 -2.247 0.026999 *
Expt3 -64.0000 23.5900 -2.713 0.007930 **
Expt4 -88.5000 23.5900 -3.752 0.000304 ***
Expt5 -77.5000 23.5900 -3.285 0.001432 **
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 74.6 on 94 degrees of freedom
Multiple R-Squared: 0.1536, Adjusted R-squared: 0.1086
F-statistic: 3.412 on 5 and 94 DF, p-value: 0.007075
Note Exp1 is used as a reference for Expt2-Expt5. The Run has only one coefficient, because we did not put the Expt*Run interaction term in. There are other outputs such as residuals, but analysis of variance table is not in the output. To get the table, you have to use
> am<-aov(Speed~Run+Expt, data=morley) > summary(am)
Df Sum Sq Mean Sq F value Pr(>F)
Run 1 412 412 0.074 0.786222
Expt 4 94514 23629 4.246 0.003336 **
Residuals 94 523098 5565
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
# Numerical integration.
area<-function(f,a,b,eps=1.0e-06,lim=10)
{
fun1<-function(f,a,b,fa,fb, a0,eps, lim, fun)
{
d<-(a+b)/2
h<-(b-a)/4
fd<-f(d)
a1<-h*(fa+fd)
a2<-h*(fd+fb)
if(abs(a0-a1-a2) < eps||lim==0)
return(a1+a2)
else {
return(fun(f,a,d, fa,fd,a1,eps,lim-1,fun)+
fun(f,d,b,fd,fb ,a2,eps,lim-1,fun))
}
}
fa<-f(a)
fb<-f(b)
a0<-((fa+fb)*(b-a))/2
fun1(f,a,b,fa,fb,a0, eps, lim, fun1)
}
f0<-function(x) {
fun00<-(1/sqrt(2*3.14))*exp(-x^2/2)
}
areanorm<-area(f0,-6,1.96)
areanorm
[1] 0.97525
You may also use integrate() (see Functions & Datasets ). integrate(functn,lower_bounds, upper_bounds,minpts,maxpts,eps,passing parameters) minpts, maxpts = minimumn (max) number of function evaluation An example to integrate a beta function
> fbeta<-function(x,alpha,beta){x^(alpha-1)*(1-x)^(beta-1)}
> byIn<-integrate(fbeta,0,1,100,500,0.01,alpha=3.5,beta=1.5)
> byIn
0.1227233 with absolute error < 0.078> byGm<-exp(lgamma(3.5)+lgamma(1.5)-lgamma(5)) # By gamma functions > byGm
[1] 0.1227185> d=byIn-byGm # To print the difference
An error message will show. What happens is that the value byIn assigned to the integral is a string "0.1227233 with absolute error < 0.078". To get the value, you need to look into the detail in help(integrate). It will tell you that the integral value is in value (another in abs,error). Now
> byIn$value-byGm # byIn$val will also work
[1] 4.79645e-06
The description of this procedure is on page 70 of in An Introduction to R. It is to minimize a function (or to maximize the negative of a function).
# Example for MLE (logistic regression) Dobson (pp. 108-110)
x<-c(1.69,1.72,1.76,1.78,1.81,1.84,1.86,1.88)
y<-c( 6, 13, 18, 18, 52, 53, 61, 60)
n<-c( 59, 60, 62, 56, 63, 59, 62, 60)
p<-c(0,0) # define the dimension of p
fn<-function(p){sum(-(y*(p[1]+p[2]*x)-n*log(1+exp(p[1]+p[2]*x))
+log(choose(n,y))))}
out<-nlm(fn,p=c(-50,20),hessian=T)
out
$minimum
[1] 25.6519
$estimate
[1] -60.33438 33.96939
$gradient
[1] -2.987290e-06 -5.480607e-06
$hessian
[,1] [,2]
[1,] 59.74214 106.4543
[2,] 106.45430 189.7951
$code
[1] 2
$iterations
[1] 15
> solve(out$hessian) # inverse of the information matrix
[,1] [,2]
[1,] 30.38361 -17.041884
[2,] -17.04188 9.563904