R: Basics of R objects; entering and manipulating data

Back to Local tips for R
http://www.psychol.cam.ac.uk/statistics/R/enteringdata.html

Very basic things:
x < 1 # enters a single number into x (< and = are alternative/synonymous assignment operators) v < c(1,2,3,4,5) # c for combine/concatenate; this creates a list or unidimensional vector temp < v>3 # will make temp equal to the logical vector c(FALSE, FALSE, FALSE, TRUE, TRUE) by performing comparisons on each element of v z < c(1:3,NA) # the fourth element of z is the "missing" value, NA (not available) is.na(z) # returns c(FALSE, FALSE, FALSE, TRUE) m < matrix( c(1,2,3,4,5,6), nrow=2, ncol=3, byrow=FALSE) # byrow=FALSE is the default, meaning that data go into the matrix filling up one column top to bottom before starting the next. # This makes m the following 2x3 matrix:
[,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
Using sequences and repetition:
1:9 # same as c(1,2,3,4,5,6,7,8,9) 1.5:10 # same as c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5) seq(1,5) # same as 1:5 seq(1,5,by=0.5) # gives 1.0, 1.5, 2.0, ... 5.0 rep(5,10) # repeats the value 5 ten times rep(c("A","B","C"),2) # same as c("A", "B", "C", "A", "B", "C") matrix(rep(0,16),nrow=r) # gives a 4x4 matrix of zeroes matrix() # gives a blank matrix (1x1 containing the missing value "NA") # To concatenate strings, use paste(): paste("A", "B", "C") # This gives the single item "ABC"  whereas c("A","B","C") gives a list of three items. # You can also get fancy: labels < paste(c("X","Y"), 1:10, sep="") # same as labels < c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10") # Note that a short list of two items is recycled until it's as long as the longest list (of 10 items).
Addressing vectors and matrices:
v[4] # 4th element of vector v (note: first element is element 1) v[c(2,4)] # second and fourth element of v m[,4] # 4th column of matrix m[4,] # 4th row of matrix m[2:4,1:3] # rows 24 of columns 13 v[!is.na[v]] # all elements of v that are not NA, in the same order as before
Factors:
grouplist < c("sham", "lesion", "sham", "sham", "lesion", "sham", "lesion") groupfactor < factor(grouplist) # makes a factor (the same length as grouplist, but marked as a factor) levels(groupfactor) # will produce "sham", "lesion" tapply(var, fac, func) # applies function "func" to all elements of "var", grouped by "fac" # example: # a < c(1,2,3,4,5,6,7,8,9) # f < factor(c("a","a","a","b","b","b","c","c","c")) # tapply(a, f, mean) # Factors can be given names (labels) as well as their original numeric values: # ... Suppose variable v1 is coded 1, 2 or 3, and we want to attach value labels 1=red, 2=blue, 3=green: mydata$v1 < factor(mydata$v1, levels = c(1,2,3), labels = c("red", "blue", "green")) # ... Or, to make the factor ordered (i.e. R knows there's an order to the levels): mydata$v1 < ordered(mydata$y, levels = c(1,3, 5), labels = c("Low", "Medium", "High")) # Use factors for nominal data, and ordered factors for ordinal data.
Lists:
mylist < list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # can now use mylist[[1]] to get "Fred" # ... or mylist$name to get "Fred" # ... or mylist$child.ages[2] to get 7
Other:
attributes(object) # views the attributes of "object" attr(object, name) # gets/sets a specific attribute (by name) of object "object"
Data frames:
mydata < data.frame(x = c(20,35,45,55,70), n = rep(50,5), y = c(6,17,26,37,44)) # mydata is a data frame. It is like a matrix with three column variables (mydata$x, mydata$n, mydata$y). # A data frame is like a matrix in which the columns may be of different types (e.g. numerical variable, factor, text). # Lots of R tests use data frames. # Let's look at it: mydata
x n y 1 20 50 6 2 35 50 17 3 45 50 26 4 55 50 37 5 70 50 44
attributes(mydata)
$names [1] "x" "n" "y" $row.names [1] 1 2 3 4 5 $class [1] "data.frame"
We can make the names prettier:
attr(mydata, "row.names") < c("pointone","pointtwo","pointthree","pointfour","pointfive") mydata
x n y pointone 20 50 6 pointtwo 35 50 17 pointthree 45 50 26 pointfour 55 50 37 pointfive 70 50 44
A few more handy functions:
names(mydata) # Lists the names of variables within mydata (also visible with attributes(), as above). str(mydata) # Shows the structure ("5 obs[ervations] of 3 variables... x is ... f is a factor with 2 levels..." etc.) levels(factor) # Shows the levels of a factor dim(object) # Shows an object's dimensions class(object) # Shows an object's class (e.g. numeric, matrix, dataframe) head(mydata, n=3) # Shows the first 3 rows tail(mydata, n=3) # Shows the last 3 rows subset(mydata, x>=45) # Pick a subset subset(mydata, x!=45, c(x,y)) # ... or another subset... there are lots of things you can do with this command; see ?subset.
It's easy to sort data frames and to create new variables based on existing ones. See QuickR / Data Management.
Making a data set visible on the main search path:
attach(my.dataset) # we now don't need to use my.dataset$X, my.dataset$Y; we can just use X and Y directly # Note that to change variables in the dataset, you still need to assign to dataset$var # (otherwise a new variable called var is created that simply "overlies" the dataset. # By the way, get used to the R convention: my.dataset is just a variable name; the dot doesn't mean anything special. search() # shows the current search path (will now include my.dataset) detach(my.dataset) # when we've finished with it # Another way, which has no residual effects: with(my.dataset, { # do stuff... })
Other important object manipulation functions:
ls() # list all objects (if you know UNIX, this will be familiar) rm(x) # removes object "x" (if you know UNIX, this will be familiar)
Typing stuff in; note also that filenames and URLs are often interchangeable:
x < scan() # type in numbers, separated by spaces or newlines; hit Enter twice to finish x < scan(filename) # do the same but reading from a file on disk x < scan("http://www...") # the same, but from a URL (live)
Editing a variable, matrix, or data frame:
y < edit(x) fix(x) # equivalent to x < edit(x)
In the R Commander, you can click the Data set button to select a data set, and then click the Edit data set button.
For more advanced data manipulation in R Commander, explore the Data menu, particularly the Data / Active data set and Data / Manage variables in active data set menus.
Often, you need to transform data between "wide" format (e.g. one row per subject; multiple observations/column per subject) and "long" format (one observation per row). R uses "long" format for most analyses. There are several methods; reshape is powerful.
See ?reshape for full details. But let's glance at longtowide transformation:
Indometh # one of the builtin R datasets. It's in "long" format.
Subject time conc 1 1 0.25 1.50 2 1 0.50 0.94 3 1 0.75 0.78 4 1 1.00 0.48 5 1 1.25 0.37 6 1 2.00 0.19 7 1 3.00 0.12 8 1 4.00 0.11 9 1 5.00 0.08 10 1 6.00 0.07 11 1 8.00 0.05 12 2 0.25 2.03 13 2 0.50 1.63 14 2 0.75 0.71 ...
# Let's reshape it as follows. Key things: # We start with one observation per row. We want to group them together by some variable that identifies an individual (group of observations). # 1. Keep SUBJECT as the identifying variable, one per row (idvar) # 2. Columns are labelled with TIME (timevar) # 3. VALUES of CONC are spread "wide" (v.names). # If v.names are not specified, all variables apart from idvar and timevar are assumed to vary, and are spread wide. # Any gaps will be filled by "NA" values. wide < reshape(Indometh, v.names="conc", idvar="Subject", timevar="time", direction="wide") wide
Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3 conc.4 conc.5 conc.6 conc.8 1 1 1.50 0.94 0.78 0.48 0.37 0.19 0.12 0.11 0.08 0.07 0.05 12 2 2.03 1.63 0.71 0.70 0.64 0.36 0.32 0.20 0.25 0.12 0.08 23 3 2.72 1.49 1.16 0.80 0.80 0.39 0.22 0.12 0.11 0.08 0.08 34 4 1.85 1.39 1.02 0.89 0.59 0.40 0.16 0.11 0.10 0.07 0.07 45 5 2.05 1.04 0.81 0.39 0.30 0.23 0.13 0.11 0.08 0.10 0.06 56 6 2.31 1.44 1.03 0.84 0.64 0.42 0.24 0.17 0.13 0.10 0.09
long < reshape(wide, direction="long") # reverses the effect completely (by using information stored within wide about the original reshaping)
And an example of a more complex longtowide transformation: creating a fictional data frame with one betweensubject factor (A) and two withinsubject factors (U, V), in long format, and then reshaping it whilst controlling the resulting column names carefully:
# First, make up a fictional dataset: s = 10; levels_U = 3; levels_V = 2; levels_A = 2; mean_U1V1A1 = 5; mean_U2V1A1 = 6; mean_U3V1A1 = 5.5; mean_U1V2A1 = 7; mean_U2V2A1 = 8; mean_U3V2A1 = 7.5; mean_U1V1A2 = 5; mean_U2V1A2 = 6; mean_U3V1A2 = 10.5; mean_U1V2A2 = 7; mean_U2V2A2 = 8; mean_U3V2A2 = 10.5; noise_sd = 1; data9 = data.frame( S = paste("S", rep(1:(s*levels_A), each=levels_U*levels_V, times=1), sep=""), U = rep(paste("U", 1:levels_U, sep=""), each=1, times=s*levels_V*levels_A), V = rep(paste("V", 1:levels_V, sep=""), each=levels_U, times=levels_A*s), A = rep(paste("A", 1:levels_A, sep=""), each=levels_U*levels_V*s, times=1), depvar = c( rep( c(mean_U1V1A1, mean_U2V1A1, mean_U3V1A1, mean_U1V2A1, mean_U2V2A1, mean_U3V2A1), each=1, times=s), rep( c(mean_U1V1A2, mean_U2V1A2, mean_U3V1A2, mean_U1V2A2, mean_U2V2A2, mean_U3V2A2), each=1, times=s) ) + rnorm(s*levels_U*levels_A*levels_B, mean=0, sd=noise_sd) ) head(data9)
S U V A depvar 1 S1 U1 V1 A1 4.262872 2 S1 U2 V1 A1 6.201466 3 S1 U3 V1 A1 6.593957 4 S1 U1 V2 A1 6.482112 5 S1 U2 V2 A1 6.894665 6 S1 U3 V2 A1 7.686345
# Handy function to strip prefixes from the column names (of cosmetic value only!) since the reshape process will add a prefix: colnames_removing_prefix < function(df, prefix) { names < colnames(df) indices < (substr(names,1,nchar(prefix))==prefix) names[indices] < substr(names[indices], nchar(prefix)+1, nchar(names[indices])) return(names) } # Now, reshape our data frame: data9$compositewsvar = factor(paste(data9$U, data9$V, sep="")) data9wide < reshape(data9, v.names="depvar", idvar="S", timevar="compositewsvar", drop=c("U","V"), direction="wide") # reshape to wide format colnames(data9wide) < colnames_removing_prefix(data9wide, "depvar.") head(data9wide)
S A U1V1 U2V1 U3V1 U1V2 U2V2 U3V2 1 S1 A1 4.262872 6.201466 6.593957 6.482112 6.894665 7.686345 7 S2 A1 4.429626 6.334341 5.483475 6.945418 8.714208 7.750707 13 S3 A1 5.634549 5.350344 6.694130 5.411150 8.461070 8.738355 19 S4 A1 6.573824 6.171209 3.898931 6.334141 7.658620 7.547592 25 S5 A1 4.377890 6.585668 5.870754 6.541264 8.790811 5.967742 31 S6 A1 4.578090 4.735252 5.940648 5.163605 6.803640 8.000925
And widetolong transformation:
dfwide < data.frame(id=1:4, age=c(40,50,60,50), dose1=c(1,2,1,2), dose2=c(2,1,2,1), dose4=c(3,3,3,3)) dfwide
id age dose1 dose2 dose4 1 1 40 1 2 3 2 2 50 2 1 3 3 3 60 1 2 3 4 4 50 2 1 3
# Key things: # We start with one individual per row. Some variables represent observations that we want to regroup into a variable # with the observation, and other variable(s) describing what sort of observation is on that row. # 1. We say which columns need to be regrouped (varying; in this case columns 3:5). # 2. By default, the "label" column that's created is called time. # 3. By default, the program assumes that the current column labels take the form "x.1", "x.2", "y.1", "y.2". # In this case (separator as "."), columns labelled "x" and "y" will be created, with "time" values of 1, 2, etc. # In this example, we use sep="" instead to show that the number follows the alphanumeric part directly. long < reshape(dfwide, direction="long", varying=3:5, sep="") long
id age time dose 1.1 1 40 1 1 2.1 2 50 1 2 3.1 3 60 1 1 4.1 4 50 1 2 1.2 1 40 2 2 2.2 2 50 2 1 3.2 3 60 2 2 4.2 4 50 2 1 1.4 1 40 4 3 2.4 2 50 4 3 3.4 3 60 4 3 4.4 4 50 4 3
See ?reshape for further ways to control the process.