Data frames

Published

October 7, 2023


Learning Objectives

  • load external data (CSV files) in memory
  • understand the concept of a data.frame
  • know how to access any element of a data.frame
  • understand factors and how to manipulate them

Reading external data

Now, we will start looking at data-sets. There are many ways of reading external datasets into R but we will review three different ways.

From R package

  • Reading data that is already in an R package. Sometimes you need to install a package, and sometimes it is already pre-installed in R. If you type data() into your console, you’ll see a list of datasets.
data()

We can choose any one of them to work on, e.g. the mtcars data.

# 1. Loading 
data("mtcars")
# 2. Print
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

From an URL

You can also read a CSV file directly from an URL where it is hosted. For example, here is a historical dataset collected by John Arbuthnot (1710). From the data page, we can read the details:

“John Arbuthnot (1710) used these time series data to carry out the first known significance test. During every one of the 82 years, there were more male christenings than female christenings. As Arbuthnot wondered, we might also wonder if this could be due to chance, or whether it meant the birth ratio was not actually 1:1.”

arbuthnot <- read.csv("https://www.openintro.org/data/csv/arbuthnot.csv")

head(arbuthnot)
  year boys girls
1 1629 5218  4683
2 1630 4858  4457
3 1631 4422  4102
4 1632 4994  4590
5 1633 5158  4839
6 1634 5035  4820

Challenge

Read this data-set from the american community survey: https://www.openintro.org/data/index.php?data=acs12. Scroll to the bottom of the page for the link to the CSV.

expand for full code
acs <- read.csv("https://www.openintro.org/data/csv/acs12.csv")
head(acs)
  income         employment hrs_work  race age gender citizen time_to_work
1  60000 not in labor force       40 white  68 female     yes           NA
2      0 not in labor force       NA white  88   male     yes           NA
3     NA               <NA>       NA white  12 female     yes           NA
4      0 not in labor force       NA white  17   male     yes           NA
5      0 not in labor force       NA white  77 female     yes           NA
6   1700           employed       40 other  35 female     yes           15
     lang married         edu disability   birth_qrtr
1 english      no     college         no jul thru sep
2 english      no hs or lower        yes jan thru mar
3 english      no hs or lower         no oct thru dec
4   other      no hs or lower         no oct thru dec
5   other      no hs or lower        yes jul thru sep
6   other     yes hs or lower        yes jul thru sep

Reading an external CSV file

The third option is reading a spreadsheet or an excel file from your computer’s hard disk. The file required for this lesson can be downlaoded by clicking on this link.

  • Move the downloaded file into the directory for this workshop.

  • To view your current working directory use the getwd() command.

Working directory is a file path that sets the location of any files you save from R, datasets you import into R, etc. This is your default file path. The function to set your working directory is: setwd(<PATH>).

To set you working directory, use the setwd() command. We want to set the working directory to the location of our project.

For example:

setwd("~/GitHub/BeginneR/R codes")

Looking at data

You are now ready to load the data. We are going to use the R function read.csv() to load the data file into memory (as a data.frame). In this case, our data is in a sub-directory called “data”.

cats <- read.csv(file = 'data/herding-cats-small.csv')

This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: cats.

cats
                street    coat    sex   age weight fixed wander_dist roamer
1       Los Robles Way   tabby female 3.003  3.993     0       0.040     no
2      242 Harding Ave maltese female 8.234 12.368     1       0.033     no
3    201 Hollywood Ave   brown female 4.601  3.947     1       0.076     no
4        140 Robin Way   black female 7.172  8.053     1       0.030     no
5       135 Charles St  calico   male 4.660  6.193     1       0.085    yes
6  130 Vista Del Campo   tabby female 3.796  3.860     1       0.085     no
7  115 Via Santa Maria   brown   male 6.917  5.626     1       0.097    yes
8      303 Harding Ave   brown   male 3.713  3.982     1       0.033     no
9     16982 Kennedy Rd   black female 2.851  3.291     0       0.065     no
10  16528 Marchmont Dr maltese female 4.594  6.994     0       0.059     no
   cat_id
1     321
2     250
3     219
4     182
5     107
6     234
7     196
8     311
9     130
10    349

However, if our dataset was larger, we probably wouldn’t want to print the whole thing to our console. Instead, we can use the head command to view the first six lines or the View command to open the dataset in a spreadsheet-like viewer.

head(cats)
View(cats)

We’ve just done two very useful things.
1. We’ve read our data in to R, so now we can work with it in R
2. We’ve created a data frame (with the read.csv command) the standard way R works with data.

What are data frames?

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting.

A data.frame is actually a list of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str() function is useful to inspect the data types of the columns.

A data.frame can be created by the functions read.csv() or read.table(), in other words, when importing spreadsheets from your hard drive (or the web).

By default, data.frame does not convert (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to TRUE:

Oscars data

We will now read a data-frame on Oscar winners! This is available on this link, with details as follows:

“Best actor and actress Oscar winners from 1929 to 2018. Note: Although there have been only 84 Oscar ceremonies until 2012, there are 85 male winners and 85 female winners because ties happened on two occasions (1933 for the best actor and 1969 for the best actress).”

Let’s now check the structure of this data.frame in more details with the function str():

oscars <- read.csv("https://www.openintro.org/data/csv/oscars.csv", 
                   stringsAsFactors = TRUE)
str(oscars)
'data.frame':   184 obs. of  11 variables:
 $ oscar_no  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ oscar_yr  : int  1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 ...
 $ award     : Factor w/ 2 levels "Best actor","Best actress": 2 2 2 2 2 2 2 2 2 2 ...
 $ name      : Factor w/ 159 levels "Adrien Brody",..: 73 110 118 106 58 94 23 11 104 104 ...
 $ movie     : Factor w/ 179 levels "7th Heaven","A Double Life",..: 1 33 132 84 160 89 64 37 139 137 ...
 $ age       : int  22 37 28 63 32 26 31 27 26 27 ...
 $ birth_pl  : Factor w/ 51 levels "Alabama","Arizona",..: 36 7 7 7 50 8 11 24 13 13 ...
 $ birth_date: Factor w/ 158 levels "1868-04-10","1868-11-09",..: 28 9 21 2 18 29 23 34 38 38 ...
 $ birth_mo  : int  10 4 8 11 10 5 9 4 1 1 ...
 $ birth_d   : int  6 8 10 9 10 12 13 5 12 12 ...
 $ birth_y   : int  1906 1892 1902 1868 1900 1907 1903 1908 1910 1910 ...

Inspecting data.frame objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

  • Size:
    • dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow() - returns the number of rows
    • ncol() - returns the number of columns
  • Content:
    • head() - shows the first 6 rows
    • tail() - shows the last 6 rows
  • Names:
    • names() - returns the column names (synonym of colnames() for data.frame objects)
    • rownames() - returns the row names
  • Summary:
    • str() - structure of the object and information about the class, length and content of each column
    • summary() - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Indexing data.frame objects

Our oscars data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).

oscars[1, 2]   # first element in the 2nd column of the data frame
oscars[1, 6]   # first element in the 6th column
oscars[1:3, 7] # first three elements in the 7th column
oscars[10, ]    # the 3rd element for all columns
oscars[, 7]    # the entire 7th column

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are birth places in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. You can use names(oscars) or colnames(oscars) to remind yourself of the column names. For instance, to extract all the oscar winners’ age at the time of win information from our dataset:

colnames(oscars)
 [1] "oscar_no"   "oscar_yr"   "award"      "name"       "movie"     
 [6] "age"        "birth_pl"   "birth_date" "birth_mo"   "birth_d"   
[11] "birth_y"   
oscars$age
  [1] 22 37 28 63 32 26 31 27 26 27 30 26 29 24 38 24 29 33 30 34 34 33 29 38 54
 [26] 24 25 47 41 27 41 38 28 27 31 37 29 25 34 60 26 61 35 34 34 26 37 42 41 35
 [51] 31 41 33 31 74 33 49 38 61 21 41 26 80 42 29 33 35 45 49 39 34 26 25 33 35
 [76] 35 28 30 29 61 32 33 45 29 62 44 40 62 53 35 47 34 33 52 41 37 38 34 32 40
[101] 43 48 41 40 49 56 41 38 41 52 51 35 30 38 41 43 48 36 47 31 46 36 56 41 44
[126] 42 43 62 43 41 48 48 56 38 60 30 40 42 37 76 39 52 45 35 61 43 51 32 42 54
[151] 52 37 38 31 45 60 46 40 36 47 29 43 37 38 45 50 48 60 50 39 22 44 54 26 28
[176] 60 44 55 44 32 41 41 59 37
oscars$birth_y
  [1] 1906 1892 1902 1868 1900 1907 1903 1908 1910 1910 1908 1913 1911 1917 1904
 [16] 1919 1915 1912 1916 1913 1914 1916 1921 1913 1898 1929 1929 1908 1915 1930
 [31] 1917 1921 1932 1934 1931 1926 1935 1940 1932 1907 1942 1907 1934 1936 1937
 [46] 1946 1936 1932 1934 1941 1946 1937 1946 1949 1907 1949 1934 1946 1924 1965
 [61] 1946 1962 1909 1948 1962 1959 1958 1949 1946 1957 1963 1972 1974 1967 1966
 [76] 1967 1975 1974 1976 1945 1975 1975 1964 1981 1949 1884 1889 1868 1878 1897
 [91] 1885 1899 1901 1883 1895 1900 1900 1905 1908 1901 1899 1895 1903 1905 1897
[106] 1891 1907 1911 1909 1899 1901 1918 1924 1917 1915 1914 1910 1923 1913 1930
[121] 1916 1927 1908 1924 1922 1925 1925 1907 1927 1930 1924 1925 1918 1937 1916
[136] 1947 1938 1937 1943 1905 1943 1931 1939 1950 1925 1944 1937 1957 1948 1937
[151] 1940 1956 1956 1964 1951 1937 1952 1959 1964 1954 1973 1960 1967 1967 1961
[166] 1957 1960 1949 1960 1972 1990 1969 1960 1989 1988 1957 1974 1957 1969 1982
[181] 1974 1975 1958 1981

In some cases, you may way to select more than one column. You can do this using the square brackets, passing in a vector of the columns to select. Suppose we wanted birth place and age information for the first 10 years.

oscars[1:10 , c("name", "age", "birth_pl", "age")]
                name age      birth_pl age.1
1       Janet Gaynor  22  Pennsylvania    22
2      Mary Pickford  37        Canada    37
3      Norma Shearer  28        Canada    28
4     Marie Dressler  63        Canada    63
5        Helen Hayes  32 Washington DC    32
6  Katharine Hepburn  26   Connecticut    26
7  Claudette Colbert  31        France    31
8        Bette Davis  27 Massachusetts    27
9       Luise Rainer  26       Germany    26
10      Luise Rainer  27       Germany    27

You can even access columns by column name and select specific rows of interest. For example, if we wanted the birth place and age of just rows 4 through 7, we could do:

oscars[4:7, c("age", "birth_pl")]
  age      birth_pl
4  63        Canada
5  32 Washington DC
6  26   Connecticut
7  31        France

We can can also use logical statements to select and filter items from a data.frame. For example, to select all rows with oscar winners from Virginia or winners at age 22 we could use the following statement

oscars[oscars$birth_pl == "Virginia", ]
    oscar_no oscar_yr        award             name               movie age
57        56     1984 Best actress Shirley MacLaine Terms of Endearment  49
83        82     2010 Best actress   Sandra Bullock      The Blind Side  45
129       43     1971   Best actor  George C. Scott              Patton  43
    birth_pl birth_date birth_mo birth_d birth_y
57  Virginia 1934-04-24        4      24    1934
83  Virginia 1964-07-26        7      26    1964
129 Virginia 1927-10-18       10      18    1927
oscars[oscars$age == 22, ]
    oscar_no oscar_yr        award              name                   movie
1          1     1929 Best actress      Janet Gaynor              7th Heaven
171       85     2013 Best actress Jennifer Lawrence Silver Linings Playbook
    age     birth_pl birth_date birth_mo birth_d birth_y
1    22 Pennsylvania 1906-10-06       10       6    1906
171  22     Kentucky 1990-08-15        8      15    1990

let’s break this down a bit. The logical statement in the brackets returns a vector of TRUE and FALSE values.

oscars$birth_pl == "Virginia"
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE

These booleans allow us to select which records we want from our data.frame

Another way to do this is with the function which(). which() finds the indexes of records meeting a logical statement

which(oscars$birth_pl == "Virginia")
[1]  57  83 129

So, we could also write

oscars[which(oscars$birth_pl == "Virginia"), ]
    oscar_no oscar_yr        award             name               movie age
57        56     1984 Best actress Shirley MacLaine Terms of Endearment  49
83        82     2010 Best actress   Sandra Bullock      The Blind Side  45
129       43     1971   Best actor  George C. Scott              Patton  43
    birth_pl birth_date birth_mo birth_d birth_y
57  Virginia 1934-04-24        4      24    1934
83  Virginia 1964-07-26        7      26    1964
129 Virginia 1927-10-18       10      18    1927

But that’s getting really long and ugly. R is already considered somewhat of an ugly duckling among programming languages, so no reason to play into the stereotype.

We can combine logical statements and index statements

oscars[oscars$birth_pl == "Virginia", c("name", "birth_pl", "age")]
                name birth_pl age
57  Shirley MacLaine Virginia  49
83    Sandra Bullock Virginia  45
129  George C. Scott Virginia  43

Finally, we can use &, the symbol for “and”, and |, the symbol for “or”, to make logical statements.

# oscars$birth_pl == "Virginia" & oscars$award == "Best actor"
oscars[oscars$birth_pl == "Virginia" & oscars$award == "Best actor", ]
    oscar_no oscar_yr      award            name  movie age birth_pl birth_date
129       43     1971 Best actor George C. Scott Patton  43 Virginia 1927-10-18
    birth_mo birth_d birth_y
129       10      18    1927

This statement selects all records with oscar winners who were born in Virginia and won in the “best actor” category.

Challenge

  1. Select all actors who won an oscar over the age of 50 and born in “New York”.
  2. Can you modify the code to select only wins in the “Best actor” category?
expand for full code
oscars[oscars$age>50 & oscars$birth_pl == "New York", ]
    oscar_no oscar_yr        award            name                   movie age
25        25     1953 Best actress   Shirley Booth Come Back, Little Sheba  54
110       24     1952   Best actor Humphrey Bogart       The African Queen  52
133       47     1975   Best actor      Art Carney         Harry and Tonto  56
151       65     1993   Best actor       Al Pacino        Scent of a Woman  52
156       70     1998   Best actor  Jack Nicholson      As Good as It Gets  60
    birth_pl birth_date birth_mo birth_d birth_y
25  New York 1898-08-30        8      30    1898
110 New York 1899-12-25       12      25    1899
133 New York 1918-11-04       11       4    1918
151 New York 1940-04-25        4      25    1940
156 New York 1937-04-22        4      22    1937
expand for full code
oscars[oscars$age>50 & oscars$birth_pl == "New York" & oscars$award == "Best actor", ]
    oscar_no oscar_yr      award            name              movie age
110       24     1952 Best actor Humphrey Bogart  The African Queen  52
133       47     1975 Best actor      Art Carney    Harry and Tonto  56
151       65     1993 Best actor       Al Pacino   Scent of a Woman  52
156       70     1998 Best actor  Jack Nicholson As Good as It Gets  60
    birth_pl birth_date birth_mo birth_d birth_y
110 New York 1899-12-25       12      25    1899
133 New York 1918-11-04       11       4    1918
151 New York 1940-04-25        4      25    1940
156 New York 1937-04-22        4      22    1937

Factors

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

In the data frame we just imported, let’s do

str(oscars)
'data.frame':   184 obs. of  11 variables:
 $ oscar_no  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ oscar_yr  : int  1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 ...
 $ award     : Factor w/ 2 levels "Best actor","Best actress": 2 2 2 2 2 2 2 2 2 2 ...
 $ name      : Factor w/ 159 levels "Adrien Brody",..: 73 110 118 106 58 94 23 11 104 104 ...
 $ movie     : Factor w/ 179 levels "7th Heaven","A Double Life",..: 1 33 132 84 160 89 64 37 139 137 ...
 $ age       : int  22 37 28 63 32 26 31 27 26 27 ...
 $ birth_pl  : Factor w/ 51 levels "Alabama","Arizona",..: 36 7 7 7 50 8 11 24 13 13 ...
 $ birth_date: Factor w/ 158 levels "1868-04-10","1868-11-09",..: 28 9 21 2 18 29 23 34 38 38 ...
 $ birth_mo  : int  10 4 8 11 10 5 9 4 1 1 ...
 $ birth_d   : int  6 8 10 9 10 12 13 5 12 12 ...
 $ birth_y   : int  1906 1892 1902 1868 1900 1907 1903 1908 1910 1910 ...

We can see the names of the multiple columns. And, we see that coat is a Factor w/ 5 levels

When we read in a file, any column that contains text is automatically assumed to be a factor. Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.

You can check this by using the function levels(), and check the number of levels using nlevels():

levels(oscars$birth_pl)
 [1] "Alabama"        "Arizona"        "Australia"      "Austria"       
 [5] "Belgium"        "California"     "Canada"         "Connecticut"   
 [9] "England"        "Florida"        "France"         "Georgia"       
[13] "Germany"        "Hungary"        "Illinois"       "India"         
[17] "Iowa"           "Isle of Wight"  "Israel"         "Italy"         
[21] "Japan"          "Kentucky"       "Louisiana"      "Massachusetts" 
[25] "Michigan"       "Minnesota"      "Missouri"       "Montana"       
[29] "Nebraska"       "New Jersey"     "New York"       "New Zealand"   
[33] "North Carolina" "Ohio"           "Oklahoma"       "Pennsylvania"  
[37] "Philly"         "Poland"         "Puerto Rico"    "Russia"        
[41] "South Africa"   "Sweden"         "Switzerland"    "Tennessee"     
[45] "Texas"          "Utah"           "Virginia"       "Wales"         
[49] "Washington"     "Washington DC"  "Wisconsin"     
nlevels(oscars$birth_pl)
[1] 51

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows to compare levels:

satisfaction <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(satisfaction)
[1] "high"   "low"    "medium"
satisfaction <- factor(satisfaction, levels = c("low", "medium", "high"))
levels(satisfaction)
[1] "low"    "medium" "high"  
min(satisfaction) ## doesn't work
Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), levels = c("low", : 'min' not meaningful for factors
satisfaction <- factor(satisfaction, levels = c("low", "medium", "high"), ordered = TRUE)
levels(satisfaction)
[1] "low"    "medium" "high"  
min(satisfaction) ## works!
[1] low
Levels: low < medium < high

In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: "low", "medium", and "high"” is more descriptive than 1, 2, 3. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the species in our example data set).

Converting factors

If you need to convert a factor to a character vector, simply use as.character(x).

Converting a factor to a numeric vector is however a little trickier, and you have to go via a character vector. Compare:

f <- factor(c(1, 5, 10, 2))
as.numeric(f)               ## wrong! and there is no warning...
[1] 1 3 4 2
as.numeric(as.character(f)) ## works...
[1]  1  5 10  2
as.numeric(levels(f))[f]    ## The recommended way.
[1]  1  5 10  2