Introduction to R
R Basics - Data Types
R is a dynamically typed language. In some languages, like C++ or Java you have to tell the program if the variable is intended to be an integer, (1), a floating point value (3.14), string (“a string”) etc. Languages like R infer the type for you. This is both good and bad. You don’t have to worry about getting it wrong. However, sometimes it leads to weird problems, one of which we’ll see below. Also, there is a slight reduction in performance. In practice, the weird problems are rare enough and the reduction is slight enough that you probably won’t notice them.
Despite not having to tell R what the data type is, you still need to know about data types. Different data types can be worked with in different ways, have different uses and limitations. Core R has a fair number of data types, a couple are unique to R. We won’t cover them here, but there are even more that have been developed and available as R packages.
Scalars
Scalars are the most basic type. Essentially a fancy name for a number of some sort. R also converts from an integer to a floating point value as needed.
> a <- 1 # an integer
> b <- 2.0 # a float
> c <- 2 # another integer
> print(a/b) # Dividing integer by an integer
[1] 0.5 # Returns a float
Vectors
Vectors are just what they sound like. A one dimensional array of the same data type. There are lots of ways to create vectors. Below are some examples of creating and working with vectors. Vectors are very common, so we’ll spend a little time on them.
# The c() function is very straight forward
> vector_one <- c(0, 1, 1, 2)
> print(vector_one)
[1] 0 1 1 2
> vector_two <- c(3, 5, 8)
> print(vector_two)
[1] 3 5 8
# Combining vectors
> vector_three <- c(vector)
> print(vector_three)
[1] 0 1 1 2 3 5 8
# More vectors
> vector_char <- c('one', 'of', 'these', 'days')
> print(vector_char)
[1] "one" "of" "these" "days"
> vector_string <- c('be', 'careful', 'with', 'that', 'axe', 'eugene')
> print(vector_string)
[1] "be" "careful" "with" "that" "axe" "eugene"
> vector_four <- c(vector_char, vector_string)
> print(vector_four)
[1] "one" "of" "these" "days" "be" "careful" "with" "that" "axe" "eugene"
# You cant mix types, weird things can happen.
> bad_vector <- c(vector_one, vector_string)
# Numbers become strings
> print(bad_vector)
[1] "0" "1" "1" "2" "be" "careful" "with" "that" "axe" "eugene"
Creating a vector one element at the time is tedious. Often you will use a series of values shown below.
# Creating a series with step one
> series_one <- c(1:20)
> print(series_one)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Creating series with different step sizes
> sequence_half_step <- seq(1,10, 0.5)
> print(sequence_half_step)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
> sequence_two_step <- seq(1,10, 2)
> print(sequence_two_step)
[1] 1 3 5 7 9
Vectors are really useful because many of R’s functions work directly on the vector. Here are some examples. Don’t worry too much about the details of the functions.
# Sum the numbers
> sum(vector_three)
[1] 20
> mean(vector_three)
[1] 2.857143
> paste(vector_four, collapse=", ")
[1] "one, of, these, days, be, careful, with, that, axe, eugene"
Members of a vector can be accessed using the [index]
index notation. If you have
worked with vector like data types in other languages, you may be used to
accessing the first item using the 0th index. R uses 1 instead.
> vector_string[1]
[1] "be"
> vector_string[3]
[1] "with"
You can get multiple indexes by supplying a vector as the index.
> vector_string[c(1,3,5)]
[1] "be" "with" "axe"
Matrixes
Matrices are are kind of like two dimensional vectors. Like vectors, all of the members must be of the same type. The work mostly like vectors so we’ll just touch on it briefly here.
Creating a matrix. By default matrixes are filled from the top left, working down the column then right to the next column. That can be changed as shown. Also, if the vector is shorter than the matrix has elements, it just starts over from the beginning of the matrix.
> matrix_1 <- matrix(1:20, nrow=5,ncol=4)
> print(matrix_1)
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
# By rows in instead
> matrix_2 <- matrix(1:20, nrow=5, ncol=4, byrow = TRUE)
> print(matrix_2)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
# You get a warning, but this still works.
> matrix_3 <- matrix(1:7, nrow=5, ncol=4)
Warning message:
In matrix(1:7, nrow = 5, ncol = 4) :
data length [7] is not a sub-multiple or multiple of the number of rows [5]
> print(matrix_3)
[,1] [,2] [,3] [,4]
[1,] 1 6 4 2
[2,] 2 7 5 3
[3,] 3 1 6 4
[4,] 4 2 7 5
[5,] 5 3 1 6
Accessing parts of a matrix is very similar to a vector.
# Getting one member
> matrix_1[1,3]
[1] 11
# Getting a column
> matrix_1[,3]
[1] 11 12 13 14 15
# Getting a row
> matrix_1[2,]
[1] 2 7 12 17
# Getting a sub matrix. This is also a matrix.
> matrix_1[2:4,1:3]
[,1] [,2] [,3]
[1,] 2 7 12
[2,] 3 8 13
[3,] 4 9 14
Arrays
Arrays work pretty much like matrixes, but they can be multidimensional.
# Array creation is pretty much like matrix creation.
> array_one <- array(c(1:60), dim=c(4,3,5))
> array_one
, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
# Print out continues...
# ... its pretty long sometimes, one slice at a time.
# This is a three dimensional array 4x3x5 so we can access it as follows
> array_one[2,3,4]
[1] 46
> array_one[2,,]
[,1] [,2] [,3] [,4] [,5]
[1,] 2 14 26 38 50
[2,] 6 18 30 42 54
[3,] 10 22 34 46 58
> array_one[2,,4]
[1] 38 42 46
Lists
A collection of things that are all of the same type is useful, but sometimes you want to collect things of different types. For this, R gives us lists.
# Creating a list, notice we have lots of types, even a vector.
# could be another list even.
> my_list <- list("Random", "Citizen", 70, c("Stochastic", "Chance"))
> my_list
[[1]]
[1] "Random"
[[2]]
[1] "Citizen"
[[3]]
[1] 70
[[4]]
[1] "Stochastic" "Chance"
You can name the items in a list as well. This can greatly improve the transparency of your code (along with renaming the variable).
> person_data <- list(first_name="Random", last_name="Citizen",
weight_kg="70", kids=c("Stochastic", "Chance"))
> person_data
$first_name
[1] "Random"
$last_name
[1] "Citizen"
$weight_kg
[1] "70"
$kids
[1] "Stochastic" "Chance"
Accessing parts of a list is a little different than vectors, matrixes or arrays.
We have to double square bracket it [[]]
.
> person_data[[2]]
[1] "Citizen"
# The power of naming again...more readable
> person_data[["weight_kg"]]
[1] "70"
# Even more readable using this notation.
> person_data$first_name
[1] "Random"
Data Frames
Data frames are probably the most commonly used data type used in R. They are essentially the “spreadsheet” of R, equivalent to datasets in other statistical packages like SAS or SPSS. When you import a CSV or Excel file it will most like be as a data frame. The “columns” of a data frame are essentially vectors, so a data frame can be thought of as a list of vectors. The only constraint is that all the vectors (columns) are the same length.
# creating a data frame
> a <- c("Random", "General", "Person", "Robert", "Sideshow")
> b <- c("Citizen", "Disarray", "Of Interest", "Terwilliger", "Bob")
> c <- c(35, 20, 32, 55, 55)
> d <- c(FALSE, TRUE, FALSE, TRUE, TRUE)
> my_data <- data.frame(a,b,c,d)
> my_data
a b c d
1 Random Citizen 35 FALSE
2 General Disarray 20 TRUE
3 Person Of Interest 32 FALSE
4 Robert Terwilliger 55 TRUE
5 Sideshow Bob 55 TRUE
# Again naming is important. Notice the original vector names are used.
> persons <- my_data
> names(persons) <- c("first_name", "last_name", "age", "villain")
> persons
first_name last_name age villain
1 Random Citizen 35 FALSE
2 General Disarray 20 TRUE
3 Person Of Interest 32 FALSE
4 Robert Terwilliger 55 TRUE
5 Sideshow Bob 55 TRUE
Accessing parts of the data frames can be accomlished in a variety of ways.
# Getting a column
persons[1]
first_name
1 Random
2 General
3 Person
4 Robert
5 Sideshow
# Getting a column a different way. More about levels in a bit.
> persons[[1]]
[1] Random General Person Robert Sideshow
Levels: General Person Random Robert Sideshow
# Getting a column a third way
> persons$age
[1] 35 20 32 55 55
# Getting a row of data
> persons[3,]
first_name last_name age villain
3 Person Of Interest 32 FALSE
# Getting a value
> persons[3,3]
[1] 32
# Getting a value another way
> persons$age[3]
[1] 32
# Getting a value the most readable way
> persons$age[which(persons$last_name == 'Terwilliger')]
[1] 55
Factors
Finally we have factors. When creating a dataframe R tends to assume strings are factors, or categories and will create them for you. Likewise, R assumes strings in vectors, matrixes and arrays are strings. These default behaviors are useful until they are not. Fortunately you can control it.
# Creating a factor from a vector of strings.
> gender <- c("male", "female", "male", "male", "female")
> gender
[1] "male" "female" "male" "male" "female"
> gender <- factor(gender)
> gender
[1] male female male male female
Levels: female male
> summary(gender)
female male
2 3
# Names aren't useful factors so we can change that.
# Conversts first names to character, not factor
> persons$first_name <- as.character(persons$first_name)
> persons$first_name
[1] "Random" "General" "Person" "Robert" "Sideshow"
# Last names are still factors.
> persons$last_name
[1] Citizen Disarray Of Interest Terwilliger Bob
Levels: Bob Citizen Disarray Of Interest Terwilliger
Supporting Files
Page 4 of 9