Data Analysis 2

Loading And Saving Data

We’ll start off with learning how to load and save data. We’ll do it a bit backwards and save a built in data set first, then load it again. We will use some United States state data from 1977 state.x77.

    > state_data <- state.x77
    > View(state_data)
    > help("state.x77")

According to the help file, the data is in matrix form. I prefer data frames so we’ll convert it to to one.

    > state_data <- data.frame(state_data)
    > View(state_data)

Saving, or writing this data to your computer is easy. Here are a few options.

    ### Export it to a file in different ways

    # Write to a comma separated file, usually preferred
    write.csv(state_data, "state_data.csv")
    
    # Substitutes commas for decimals in number and semicolons for commas
    # as a delimiter, useful for Europeans
    write.csv2(state_data, "state_data_eur.csv")
    
    # You can specify your own delimiters too
    write.table(state_data, "state_data_tab.txt", sep="\t")

Importing data is usually as easy. However, in practice you will often find data that is poorly or inconsistently formatted, this can take some time to fix.

    # Notice R adds an X for the state name in some instances.
    
    > new_state_data <- read.csv("state_data.csv")
    > head(new_state_data)
                   X Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
        1    Alabama       3615   3624        2.1    69.05   15.1    41.3    20  50708
        2     Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
        3    Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417
        4   Arkansas       2110   3378        1.9    70.66   10.1    39.9    65  51945
        5 California      21198   5114        1.1    71.71   10.3    62.6    20 156361
        6   Colorado       2541   4884        0.7    72.06    6.8    63.9   166 103766
    > eur_state_data <- read.csv2("state_data_eur.csv")
    > head(eur_state_data)
                   X Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
        1    Alabama       3615   3624        2.1    69.05   15.1    41.3    20  50708
        2     Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
        3    Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417
        4   Arkansas       2110   3378        1.9    70.66   10.1    39.9    65  51945
        5 California      21198   5114        1.1    71.71   10.3    62.6    20 156361
        6   Colorado       2541   4884        0.7    72.06    6.8    63.9   166 103766
    > tab_state_data <- read.table("state_data_tab.txt", sep = "\t")
    > head(tab_state_data)
                   Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
        Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
        Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
        Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
        Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
        California      21198   5114        1.1    71.71   10.3    62.6    20 156361
        Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

More Analyses

Correlations and Plot Matrixes

There are a number of variables in this data. A common thing to do is to create a correlation matrix among them. A correlation measures a linear relationship between two variables. A positive correlation suggests that as one parameter goes up so does the other. A negative correlation suggests that as one parameter increases the other decreases. We measure the strength of the correlation with a correlation coefficient R, no relation to the statistical language. The correlation coefficient varies from -1 to 1. A coefficient of zero means there is no relationship between the two variables. The closer the coefficient is to -1 or 1, the stronger the relationshop. A value of -1 or 1 would have all the points in a line. We can compute them all at once using the cor() function.

    > cor(state_data)
                    Population     Income  Illiteracy    Life.Exp     Murder     HS.Grad      Frost        Area
        Population  1.00000000  0.2082276  0.10762237 -0.06805195  0.3436428 -0.09848975 -0.3321525  0.02254384
        Income      0.20822756  1.0000000 -0.43707519  0.34025534 -0.2300776  0.61993232  0.2262822  0.36331544
        Illiteracy  0.10762237 -0.4370752  1.00000000 -0.58847793  0.7029752 -0.65718861 -0.6719470  0.07726113
        Life.Exp   -0.06805195  0.3402553 -0.58847793  1.00000000 -0.7808458  0.58221620  0.2620680 -0.10733194
        Murder      0.34364275 -0.2300776  0.70297520 -0.78084575  1.0000000 -0.48797102 -0.5388834  0.22839021
        HS.Grad    -0.09848975  0.6199323 -0.65718861  0.58221620 -0.4879710  1.00000000  0.3667797  0.33354187
        Frost      -0.33215245  0.2262822 -0.67194697  0.26206801 -0.5388834  0.36677970  1.0000000  0.05922910
        Area        0.02254384  0.3633154  0.07726113 -0.10733194  0.2283902  0.33354187  0.0592291  1.00000000

We can also plot a matrix. There are lots of ways to do this. Here is one, found after a quick web search on “R correlation plot matrix”. It is a good practice to research what you want to do before you do it. Often most of the hard work will be done for you. This method arranges and color codes the boxes by the strength of the correlation. The result is shown in Figure 1.

   # install and load "gclus" first.
   
   # Use the help() function to learn details about these functions
    # Need to get absolute values of correlations for sorting 
    state_correlations <- abs(cor(state_data))
    
    # dmat.color creates nice color palettes
    state_colors <- dmat.color(state_correlations)
    ordered_correlations <- order.single(state_correlations)
    
    # From gclus creates a matrix of the pairs. Since the middle
    # row will all be ones and not very interesting, it uses the names
    cpairs(
      state_data,
      ordered_correlations,
      panel.colors = state_colors,
      gap = 0.5,
      main="Correlations Bewteen Population Variables"
    )
    dev.copy(png, "corr_matrix.png")
    dev.off()

Correlation plot matrix of state data — Figure 1. A plot matrix showing the correlations between the variable in the state.x77 data. The plots are arranged with the strongest correlations in the center and color coded so that the red are strongest and the yellow are the weakest correlations.

Looking at our correlations we see murder rate is associated with a few things, illiteracy, life expectancy, high school graduation rate and income. There are a lot of things we can do here. We’ll get a bit more advanced but not too fancy.

Regressions

Regressions appear to be like correlations but there is a subtle and very important difference. Correlations measure a relationship, but not a causal relationship. So from our data, some things make sense as a causal relationship, others don’t. There is a relationship between murder and frost for instance, but it is hard to imagine that murders cause frost to change, but the reverse could be true. The same is true of illiteracy and murder and income and murder. We will treat murder as our dependent variable, the variable that depends on the other independent variables. Starting simply with one variable we’ll work up.

    > reg_one <- lm(murder_rate ~ illiteracy, data = state_data)
    > summary(reg_one)
    
    Call:
    lm(formula = murder_rate ~ illiteracy, data = state_data)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -5.5315 -2.0602 -0.2503  1.6916  6.9745 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   2.3968     0.8184   2.928   0.0052 ** 
    illiteracy    4.2575     0.6217   6.848 1.26e-08 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 2.653 on 48 degrees of freedom
    Multiple R-squared:  0.4942,	Adjusted R-squared:  0.4836 
    F-statistic: 46.89 on 1 and 48 DF,  p-value: 1.258e-08

This should look similar to our last analysis. The primary thing we are interested in is the illiteracy coefficient, specifically Pr(>|t|). This is our p value. We see that it is less than 0.05 so we can accept the idea that the murder rate is dependent on the rate of illiteracy among states. If it were greater than 0.05, we would conclude illiteracy has no effect. We do not typically worry too much about the intercept, although here we can surmize that if everyone could read, there will still be a murder rate of 2.3938 murders per 100,000 people and that this estimate is statistically greater than zero.

We easily plot this as shown in Figure 2.

    > plot(state_data$illiteracy, state_data$murder_rate, 
                        xlab="Illiteracy (1970, percent of population)", 
                        ylab="Murder Rate(1976, per 100,000)")
                        
      # Plots the line derived form our regression using the intercept and 
      # slope (y = mx + b).
    > abline(reg_one)

Our final level of complexity is to add another variable, mean number of days below freezing. This part isn’t actually that hard.

Graphically we’ll also use another package, ggplot2. This is a widely used package so you should know about. Don’t worry about understanding everything at this point. It takes some practice and exposure. There are several advantages to ggplot2. First, it takes a modular approach. You can add (and subtract) elements to your graphics using what ggplot calls aesthetics (aes()) using a plus sign. This makes for cleaner more readable code. Also, it enforces a common interface among all the functions in the package, something that is lacking in built in R graphical functions. It does take some getting used to though, and there is nothing inherently wrong with the built in packages.

First lets regress murder rate and frost.

    # First lets see what frost alone looks like.
    > reg_frost <- lm(murder_rate ~ frost, data = state_data)
    > summary(reg_frost)
    
    Call:
    lm(formula = murder_rate ~ frost, data = state_data)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -5.8510 -2.6336 -0.2825  2.2983  7.3191 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 11.375689   1.005494  11.314 3.83e-15 ***
    frost       -0.038270   0.008635  -4.432 5.40e-05 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 3.142 on 48 degrees of freedom
    Multiple R-squared:  0.2904,	Adjusted R-squared:  0.2756 
    F-statistic: 19.64 on 1 and 48 DF,  p-value: 5.405e-05

We see a significant relationship! It would suggest that murder rate decreases in colder climates. However, lets do a multiple regression as shown below. The code and interpretation are fairly straight forward.

    > multi_reg <- lm(murder_rate ~ illiteracy + frost, data = state_data)
    > summary(multi_reg)
    
        Call:
        lm(formula = murder_rate ~ illiteracy + frost, data = state_data)
        
        Residuals:
            Min      1Q  Median      3Q     Max 
        -5.2732 -1.7953 -0.2699  1.7013  7.3633 
        
        Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
        (Intercept)  3.873964   1.880853   2.060    0.045 *  
        illiteracy   3.763897   0.841564   4.473 4.88e-05 ***
        frost       -0.008613   0.009868  -0.873    0.387    
        ---
        Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
        
        Residual standard error: 2.659 on 47 degrees of freedom
        Multiple R-squared:  0.5022,	Adjusted R-squared:  0.4811 
        F-statistic: 23.71 on 2 and 47 DF,  p-value: 7.585e-08

Now we see that while illiteracy is still a significant predictor of murder rate, days below freezing is not. The multiple regression is accounting for the relationship between illiteracy and frost, and once that is removed, only illiteracy is significant. It would appear that colder states are just more literate. We can leave it to you to do that regression.

Visualization with GGPlot2

Now we will use ggplot2 to create a visualization for this analysis. Of course there are multiple ways to display this data, so feel free and explore. We’ll take a step-by-step approach, that is not only a good workflow with ggplot2, but also demonstrates one of its advantages.

Plot with points added — Figure 3a. GGPlot2 showing illitracy vs murder rates with points added.

Plot with the regression line added. — Figure 3b. The regression line was added with geom_smooth().

Use of a theme demonstrated on ggplot2 — Figure 3c. A theme was applied to the plot.

Size and color demonstrated. — Figure 3d. Frost and income are given by changing the color and size of the points respectively. A title as added and sthe axis labels were also improved.

    # Make sure you install and load ggplot2
    
    > frost_by_illiteracy <- ggplot(state_data, aes(x=illiteracy, y=murder_rate))
    > frost_by_illiteracy
    

Notice you don’t see anything but an empty graph. Ggplot2 requires each element be add individually. This is actually a very good thing, it gives you tons of control, and also makes it easier to build up very complicated visualizations. Let’s add our points and line next, take note of the changes with each line (Figure 3a, Figure 3b).

    > frost_by_illiteracy <- frost_by_illiteracy + geom_point()
    > frost_by_illiteracy
    > frost_by_illiteracy <- frost_by_illiteracy + geom_smooth(method = lm, se = FALSE)
    # The se = FALSE turns off standard error lines for the plot. Turn it on and see what happenes.
    > frost_by_illiteracy

The geometries (geoms in R speak) are the standard way of adding elements to a plot. There are a lot of built in geometries you can find on the GGPlot2 geoms page. Each geom can just be added to your existing plot, you need to save it to a variable though to keep the change.

The default plotting may not be to your taste. You can tweak the options individually, or take the easy way out and use ggthemes(). As with the geoms there are a number of built-in themes, and more you can find in packages with a web search (Figure 3c).

    > frost_by_illiteracy <- frost_by_illiteracy + theme_classic()
    > frost_by_illiteracy

We can incorporate more information by adding color and point size. Notice the use of an aesthetic (aes()) to modify a geom. We’ll also fix the labels.

    > frost_by_illiteracy <- frost_by_illiteracy + 
                                geom_point(aes(color = frost, size = income)) +
                                labs(
                                   x = "Illiteracy\nPercent of Population", 
                                   y = "Murder Rate\nPer 100,000 People", 
                                   title = "Factors Affecting The Murder Rate in the USA"
                                   )
    > frost_by_illiteracy

And finally we can do this all in a neat, easy to read statement. The following does the same thing as the lines above, plus we add a save (Figure 3d).

    > frost_by_illiteracy <- ggplot(state_data, aes(x=illiteracy, y=murder_rate)) +
                             geom_point(aes(color = frost, size = income)) +
                             geom_smooth(method = lm, se = FALSE) + 
                             theme_classic() + 
                             labs(  x = "Illiteracy\nPercent of Population", 
                                    y = "Murder Rate\nPer 100,000 People", 
                                    title = "Factors Affecting The Murder Rate in the USA"
                                  )
    > frost_by_illiteracy
    > ggsave("fancy_plot.png", plot = frost_by_illiteracy)
         Saving 8.15 x 9.75 in image

Supporting Files

Example script for this unit.