Day 1: data input, manipulation and output

_______________________________________________________________________________________________

1 Introduction to R

R is an open-source statistical programming language. It is used to manipulate data, to perform statistical analyses, and to present graphical and other results. R consists of a core language, additional ‘packages’ distributed with the R language, and a very large number of packages contributed by the broader community. Packages add speciﬁc functionality to an R installation. R has become the primary language of academic statistical analyses, and is widely used in diverse areas of research, government, and industry.

R has several unique features. It has a surprisingly ‘old school’ interface: users type commands into a console; scripts in plain text represent work ﬂows; tools other than R are used for editing and other tasks. R is a ﬂexible programming language, so while one person might use functions provided by R to accomplish advanced analytic tasks, another might implement their own functions for novel data types.

As a programming language, R adopts syntax and grammar that diﬀer from many other languages: objects in R are ‘vectors’, and functions are ‘vectorized’ to operate on all elements of the object; R objects have ‘copy on change’ and ‘pass by value’ semantics, reducing unexpected consequences for users at the expense of less eﬃcient memory use; common paradigms in other languages, such as the ‘for’ loop, are encountered much less commonly in R.

Many authors contribute to R, so there can be a frustrating inconsistency of documentation and interface. R grew up in the academic community, so authors have not shied away from trying new approaches. Common statistical analyses are very well-developed.

1.1 R’s capabilities

1.2 Resources for beginners

_______________________________________________________________________________________________

2 Using RStudio

The RStudio application provides a convenient and ﬂexible environment for your work with R. Figure 1 presents a view of a typical RStudio session.

See http://www.rstudio.com/ide for documentation and downloads for oﬄine work on Linux, Windows and MAC systems.

2.1 Help

R comes with extensive help. In RStudio, click the ‘help’ menu and choose ‘R help’. Important sections include

When not using RStudio, start the help system by typing the command help.start(). The R web site provides many useful links. Once you are comfortable with R, the R-help mailing list can be a very useful source of information, as can general-purpose forums like StackOverﬂow.

_______________________________________________________________________________________________

3 Simple expressions

R is a quite sophisticated system for data analysis, but that doesn’t mean it’s not comprehensible to beginners. Let’s start using it:

  > x = c(2, 4, 3)
  > y = 2 + 2
  > x/y

_______________________________________________________________________________________________

4 Input

Spreadsheet applications and R complement each other in that the former can provide nicely formatted columns, colored headers and convenient scrolling while R provides functional ﬂexibility. You can access Excel worksheets and other table-like data ﬁles with R, and use R to write ﬁles readable by spreadsheets – R reads and writes tables written in comma- or tab-separated values formats.

We’ll ﬁrst read the table ALLannotationFromExcel.txt that contains ALL (acute lymphoblastic leukemia) patient information:

  > filename = file.choose() # Go to the data directory to get the file
  > info = read.delim(filename)
  > ?read.delim

   [1] "id"             "diagnosis"      "sex"            "age"            "BT"
   [6] "remission"      "CR"             "date.cr"        "t.4.11."        "t.9.22."
  [11] "cyto.normal"    "citog"          "mol.biol"       "fusion.protein" "mdr"
  [16] "kinet"          "ccr"            "relapse"        "transplant"     "f.u"
  [21] "date.last.seen"

      id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22. cyto.normal
  1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE       FALSE
  2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE       FALSE
  3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA          NA
  4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE       FALSE
  5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE       FALSE
  6 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE       FALSE
           citog mol.biol fusion.protein mdr   kinet   ccr relapse transplant
  1      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE   FALSE       TRUE
  2  simple alt.      NEG           <NA> POS dyploid FALSE    TRUE      FALSE
  3         <NA>  BCR/ABL           p190 NEG dyploid FALSE    TRUE      FALSE
  4      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE    TRUE      FALSE
  5      del(6q)      NEG           <NA> NEG dyploid FALSE    TRUE      FALSE
  6 complex alt.      NEG           <NA> NEG hyperd. FALSE    TRUE      FALSE
                  f.u date.last.seen
  1 BMT / DEATH IN CR           <NA>
  2               REL      8/28/2000
  3               REL     10/15/1999
  4               REL      1/23/1998
  5               REL      11/4/1997
  6               REL     12/15/1997

> summary(info$cyto.normal)

Mode FALSE TRUE NA's
logical 69 24 34

∙ Exercise: Read ﬁle ALLmetadata.txt from the same directory as before, assigning the name ’doc’ to the data frame created. How does ’doc’ relate to ’info’?

_______________________________________________________________________________________________

5 Manipulation

R doesn’t provide scrollbars like spreadsheet applications do, but you can examine subsets of large objects like the 127x21 info data frame – for example by explicitly giving the rows and columns you want to see:

     sex age
  1    M  53
  2    M  19
  3    F  52
  4    M  38
  5    M  57
  6    M  17
  7    F  18
  8    M  16
  9    M  15
  10   M  40

> info[1:10, ] # What do these do?
> info[, 3:4]

    sex age BT
  1   M  53 B2
  2   M  19 B2
  3   F  52 B4
  4   M  38 B1
  5   M  57 B2
  6   M  17 B1

      sex age BT
  122   M  32 T3
  123   M  24 T3
  124   M  37 T3
  125   M  19 T2
  126   M  30 T3
  127   M  29 T2

∙Exercise: List the ﬁrst 10 rows of columns ’sex’, ’remission’ and ’date.last.seen’ in data frame info.

[1] 53 19 52 38 57 17

[1] M M F M M M
Levels: F M

> info$age[info$age > 21]

   [1] 53 52 38 57 40 33 55 41 27 27 46 37 36 53 39 53 44 28 58 43 48 58 26 32 45 51 57 29
  [29] 32 NA 49 38 26 48 22 47 54 26 47 52 27 52 23 NA 54 25 31 24 23 NA 41 37 54 43 53 50
  [57] 54 53 49 26 22 36 27 50 NA 31 48 40 22 30 22 50 41 40 28 25 31 24 37 23 30 48 22 41
  [85] 52 32 24 37 30 29

> info$sex[info$sex == 'M']

   [1] M    M    M    M    M    M    M    M    M    M    M    M    M    M    M    M    M
  [18] M    M    M    M    M    M    M    M    M    M    <NA> M    M    M    M    M    M
  [35] M    M    M    M    M    M    M    M    M    M    M    M    M    M    M    M    M
  [52] M    M    M    M    M    M    M    M    M    <NA> M    M    M    M    M    M    M
  [69] M    M    M    M    M    M    M    M    M    M    M    M    M    M    M    M    M
  Levels: F M

> info$sex[info$sex == 'M' & !is.na(info$sex)]

   [1] M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
  [44] M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
  Levels: F M

_______________________________________________________________________________________________

6 Data types

R provides several data types that follow rules you might expect: the numeric can be used in arithmetic operations, the logical results from TRUE or FALSE questions, character data are used to handle text, and factors are used in statistical analyses, ANOVA for example.

> log2(x) # A commonly used transformation: log base 2

> log(x) # What is the base of this logarithm?

> sqrt(x) # What is this transform?

> z = substr('Hi there!', 1, 5)
> z

∙Exercise: Use the table function to count the number of ’M’ and ’F’ patients identiﬁed in the info data frame. Does that account for all patients?

_______________________________________________________________________________________________

7 Output

Select a subset of patients with normal cytogenetics and no translocations to write to a separate ﬁle and then read with a spreadsheet application:

  > ?write.table
  > idx = with(info, cyto.normal==TRUE & !is.na(cyto.normal))
  > write.table(info[idx,], file='cytoNormal.txt', sep='\t',
  +             row.names=FALSE, quote=FALSE)
  > write.table(info[idx,], file='cytoNormal.csv', sep=',',
  +             row.names=FALSE, quote=FALSE)