data()
Data frames
Learning Objectives
- load external data (CSV files) in memory
- understand the concept of a
data.frame
- know how to access any element of a
data.frame
- understand
factors
and how to manipulate them
Reading external data
Now, we will start looking at data-sets. There are many ways of reading external datasets into R
but we will review three different ways.
From R
package
- Reading data that is already in an
R
package. Sometimes you need to install a package, and sometimes it is already pre-installed in R. If you typedata()
into your console, you’ll see a list of datasets.
We can choose any one of them to work on, e.g. the mtcars
data.
# 1. Loading
data("mtcars")
# 2. Print
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
From an URL
You can also read a CSV file directly from an URL where it is hosted. For example, here is a historical dataset collected by John Arbuthnot (1710). From the data page, we can read the details:
“John Arbuthnot (1710) used these time series data to carry out the first known significance test. During every one of the 82 years, there were more male christenings than female christenings. As Arbuthnot wondered, we might also wonder if this could be due to chance, or whether it meant the birth ratio was not actually 1:1.”
<- read.csv("https://www.openintro.org/data/csv/arbuthnot.csv")
arbuthnot
head(arbuthnot)
year boys girls
1 1629 5218 4683
2 1630 4858 4457
3 1631 4422 4102
4 1632 4994 4590
5 1633 5158 4839
6 1634 5035 4820
Challenge
Read this data-set from the american community survey: https://www.openintro.org/data/index.php?data=acs12. Scroll to the bottom of the page for the link to the CSV.
expand for full code
<- read.csv("https://www.openintro.org/data/csv/acs12.csv")
acs head(acs)
income employment hrs_work race age gender citizen time_to_work
1 60000 not in labor force 40 white 68 female yes NA
2 0 not in labor force NA white 88 male yes NA
3 NA <NA> NA white 12 female yes NA
4 0 not in labor force NA white 17 male yes NA
5 0 not in labor force NA white 77 female yes NA
6 1700 employed 40 other 35 female yes 15
lang married edu disability birth_qrtr
1 english no college no jul thru sep
2 english no hs or lower yes jan thru mar
3 english no hs or lower no oct thru dec
4 other no hs or lower no oct thru dec
5 other no hs or lower yes jul thru sep
6 other yes hs or lower yes jul thru sep
Reading an external CSV file
The third option is reading a spreadsheet or an excel file from your computer’s hard disk. The file required for this lesson can be downlaoded by clicking on this link.
Move the downloaded file into the directory for this workshop.
To view your current working directory use the
getwd()
command.
Working directory is a file path that sets the location of any files you save from R, datasets you import into R, etc. This is your default file path. The function to set your working directory is: setwd(<PATH>)
.
To set you working directory, use the setwd()
command. We want to set the working directory to the location of our project.
For example:
setwd("~/GitHub/BeginneR/R codes")
Looking at data
You are now ready to load the data. We are going to use the R function read.csv()
to load the data file into memory (as a data.frame
). In this case, our data is in a sub-directory called “data”.
<- read.csv(file = 'data/herding-cats-small.csv') cats
This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: cats
.
cats
street coat sex age weight fixed wander_dist roamer
1 Los Robles Way tabby female 3.003 3.993 0 0.040 no
2 242 Harding Ave maltese female 8.234 12.368 1 0.033 no
3 201 Hollywood Ave brown female 4.601 3.947 1 0.076 no
4 140 Robin Way black female 7.172 8.053 1 0.030 no
5 135 Charles St calico male 4.660 6.193 1 0.085 yes
6 130 Vista Del Campo tabby female 3.796 3.860 1 0.085 no
7 115 Via Santa Maria brown male 6.917 5.626 1 0.097 yes
8 303 Harding Ave brown male 3.713 3.982 1 0.033 no
9 16982 Kennedy Rd black female 2.851 3.291 0 0.065 no
10 16528 Marchmont Dr maltese female 4.594 6.994 0 0.059 no
cat_id
1 321
2 250
3 219
4 182
5 107
6 234
7 196
8 311
9 130
10 349
However, if our dataset was larger, we probably wouldn’t want to print the whole thing to our console. Instead, we can use the head
command to view the first six lines or the View
command to open the dataset in a spreadsheet-like viewer.
head(cats)
View(cats)
We’ve just done two very useful things.
1. We’ve read our data in to R, so now we can work with it in R
2. We’ve created a data frame (with the read.csv
command) the standard way R works with data.
What are data frames?
data.frame
is the de facto data structure for most tabular data and what we use for statistics and plotting.
A data.frame
is actually a list
of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str()
function is useful to inspect the data types of the columns.
A data.frame
can be created by the functions read.csv()
or read.table()
, in other words, when importing spreadsheets from your hard drive (or the web).
By default, data.frame
does not convert (= coerces) columns that contain characters (i.e., text) into the factor
data type. Depending on what you want to do with the data, you may want to keep these columns as character
. To do so, read.csv()
and read.table()
have an argument called stringsAsFactors
which can be set to TRUE
:
Oscars data
We will now read a data-frame on Oscar winners! This is available on this link, with details as follows:
“Best actor and actress Oscar winners from 1929 to 2018. Note: Although there have been only 84 Oscar ceremonies until 2012, there are 85 male winners and 85 female winners because ties happened on two occasions (1933 for the best actor and 1969 for the best actress).”
Let’s now check the structure of this data.frame
in more details with the function str()
:
<- read.csv("https://www.openintro.org/data/csv/oscars.csv",
oscars stringsAsFactors = TRUE)
str(oscars)
'data.frame': 184 obs. of 11 variables:
$ oscar_no : int 1 2 3 4 5 6 7 8 9 10 ...
$ oscar_yr : int 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 ...
$ award : Factor w/ 2 levels "Best actor","Best actress": 2 2 2 2 2 2 2 2 2 2 ...
$ name : Factor w/ 159 levels "Adrien Brody",..: 73 110 118 106 58 94 23 11 104 104 ...
$ movie : Factor w/ 179 levels "7th Heaven","A Double Life",..: 1 33 132 84 160 89 64 37 139 137 ...
$ age : int 22 37 28 63 32 26 31 27 26 27 ...
$ birth_pl : Factor w/ 51 levels "Alabama","Arizona",..: 36 7 7 7 50 8 11 24 13 13 ...
$ birth_date: Factor w/ 158 levels "1868-04-10","1868-11-09",..: 28 9 21 2 18 29 23 34 38 38 ...
$ birth_mo : int 10 4 8 11 10 5 9 4 1 1 ...
$ birth_d : int 6 8 10 9 10 12 13 5 12 12 ...
$ birth_y : int 1906 1892 1902 1868 1900 1907 1903 1908 1910 1910 ...
Inspecting data.frame
objects
We already saw how the functions head()
and str()
can be useful to check the content and the structure of a data.frame
. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.
- Size:
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
- returns the number of rowsncol()
- returns the number of columns
- Content:
head()
- shows the first 6 rowstail()
- shows the last 6 rows
- Names:
names()
- returns the column names (synonym ofcolnames()
fordata.frame
objects)
rownames()
- returns the row names
- Summary:
str()
- structure of the object and information about the class, length and content of each column
summary()
- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
Indexing data.frame
objects
Our oscars data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).
1, 2] # first element in the 2nd column of the data frame
oscars[1, 6] # first element in the 6th column
oscars[1:3, 7] # first three elements in the 7th column
oscars[10, ] # the 3rd element for all columns
oscars[7] # the entire 7th column oscars[,
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are birth places in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.
You can do operations on a particular column, by selecting it using the $
sign. In this case, the entire column is a vector. You can use names(oscars)
or colnames(oscars)
to remind yourself of the column names. For instance, to extract all the oscar winners’ age at the time of win information from our dataset:
colnames(oscars)
[1] "oscar_no" "oscar_yr" "award" "name" "movie"
[6] "age" "birth_pl" "birth_date" "birth_mo" "birth_d"
[11] "birth_y"
$age oscars
[1] 22 37 28 63 32 26 31 27 26 27 30 26 29 24 38 24 29 33 30 34 34 33 29 38 54
[26] 24 25 47 41 27 41 38 28 27 31 37 29 25 34 60 26 61 35 34 34 26 37 42 41 35
[51] 31 41 33 31 74 33 49 38 61 21 41 26 80 42 29 33 35 45 49 39 34 26 25 33 35
[76] 35 28 30 29 61 32 33 45 29 62 44 40 62 53 35 47 34 33 52 41 37 38 34 32 40
[101] 43 48 41 40 49 56 41 38 41 52 51 35 30 38 41 43 48 36 47 31 46 36 56 41 44
[126] 42 43 62 43 41 48 48 56 38 60 30 40 42 37 76 39 52 45 35 61 43 51 32 42 54
[151] 52 37 38 31 45 60 46 40 36 47 29 43 37 38 45 50 48 60 50 39 22 44 54 26 28
[176] 60 44 55 44 32 41 41 59 37
$birth_y oscars
[1] 1906 1892 1902 1868 1900 1907 1903 1908 1910 1910 1908 1913 1911 1917 1904
[16] 1919 1915 1912 1916 1913 1914 1916 1921 1913 1898 1929 1929 1908 1915 1930
[31] 1917 1921 1932 1934 1931 1926 1935 1940 1932 1907 1942 1907 1934 1936 1937
[46] 1946 1936 1932 1934 1941 1946 1937 1946 1949 1907 1949 1934 1946 1924 1965
[61] 1946 1962 1909 1948 1962 1959 1958 1949 1946 1957 1963 1972 1974 1967 1966
[76] 1967 1975 1974 1976 1945 1975 1975 1964 1981 1949 1884 1889 1868 1878 1897
[91] 1885 1899 1901 1883 1895 1900 1900 1905 1908 1901 1899 1895 1903 1905 1897
[106] 1891 1907 1911 1909 1899 1901 1918 1924 1917 1915 1914 1910 1923 1913 1930
[121] 1916 1927 1908 1924 1922 1925 1925 1907 1927 1930 1924 1925 1918 1937 1916
[136] 1947 1938 1937 1943 1905 1943 1931 1939 1950 1925 1944 1937 1957 1948 1937
[151] 1940 1956 1956 1964 1951 1937 1952 1959 1964 1954 1973 1960 1967 1967 1961
[166] 1957 1960 1949 1960 1972 1990 1969 1960 1989 1988 1957 1974 1957 1969 1982
[181] 1974 1975 1958 1981
In some cases, you may way to select more than one column. You can do this using the square brackets, passing in a vector of the columns to select. Suppose we wanted birth place and age information for the first 10 years.
1:10 , c("name", "age", "birth_pl", "age")] oscars[
name age birth_pl age.1
1 Janet Gaynor 22 Pennsylvania 22
2 Mary Pickford 37 Canada 37
3 Norma Shearer 28 Canada 28
4 Marie Dressler 63 Canada 63
5 Helen Hayes 32 Washington DC 32
6 Katharine Hepburn 26 Connecticut 26
7 Claudette Colbert 31 France 31
8 Bette Davis 27 Massachusetts 27
9 Luise Rainer 26 Germany 26
10 Luise Rainer 27 Germany 27
You can even access columns by column name and select specific rows of interest. For example, if we wanted the birth place and age of just rows 4 through 7, we could do:
4:7, c("age", "birth_pl")] oscars[
age birth_pl
4 63 Canada
5 32 Washington DC
6 26 Connecticut
7 31 France
We can can also use logical statements to select and filter items from a data.frame
. For example, to select all rows with oscar winners from Virginia or winners at age 22 we could use the following statement
$birth_pl == "Virginia", ] oscars[oscars
oscar_no oscar_yr award name movie age
57 56 1984 Best actress Shirley MacLaine Terms of Endearment 49
83 82 2010 Best actress Sandra Bullock The Blind Side 45
129 43 1971 Best actor George C. Scott Patton 43
birth_pl birth_date birth_mo birth_d birth_y
57 Virginia 1934-04-24 4 24 1934
83 Virginia 1964-07-26 7 26 1964
129 Virginia 1927-10-18 10 18 1927
$age == 22, ] oscars[oscars
oscar_no oscar_yr award name movie
1 1 1929 Best actress Janet Gaynor 7th Heaven
171 85 2013 Best actress Jennifer Lawrence Silver Linings Playbook
age birth_pl birth_date birth_mo birth_d birth_y
1 22 Pennsylvania 1906-10-06 10 6 1906
171 22 Kentucky 1990-08-15 8 15 1990
let’s break this down a bit. The logical statement in the brackets returns a vector of TRUE
and FALSE
values.
$birth_pl == "Virginia" oscars
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE
These booleans
allow us to select which records we want from our data.frame
Another way to do this is with the function which()
. which()
finds the indexes of records meeting a logical statement
which(oscars$birth_pl == "Virginia")
[1] 57 83 129
So, we could also write
which(oscars$birth_pl == "Virginia"), ] oscars[
oscar_no oscar_yr award name movie age
57 56 1984 Best actress Shirley MacLaine Terms of Endearment 49
83 82 2010 Best actress Sandra Bullock The Blind Side 45
129 43 1971 Best actor George C. Scott Patton 43
birth_pl birth_date birth_mo birth_d birth_y
57 Virginia 1934-04-24 4 24 1934
83 Virginia 1964-07-26 7 26 1964
129 Virginia 1927-10-18 10 18 1927
But that’s getting really long and ugly. R is already considered somewhat of an ugly duckling among programming languages, so no reason to play into the stereotype.
We can combine logical statements and index statements
$birth_pl == "Virginia", c("name", "birth_pl", "age")] oscars[oscars
name birth_pl age
57 Shirley MacLaine Virginia 49
83 Sandra Bullock Virginia 45
129 George C. Scott Virginia 43
Finally, we can use &
, the symbol for “and”, and |
, the symbol for “or”, to make logical statements.
# oscars$birth_pl == "Virginia" & oscars$award == "Best actor"
$birth_pl == "Virginia" & oscars$award == "Best actor", ] oscars[oscars
oscar_no oscar_yr award name movie age birth_pl birth_date
129 43 1971 Best actor George C. Scott Patton 43 Virginia 1927-10-18
birth_mo birth_d birth_y
129 10 18 1927
This statement selects all records with oscar winners who were born in Virginia and won in the “best actor” category.
Challenge
- Select all actors who won an oscar over the age of 50 and born in “New York”.
- Can you modify the code to select only wins in the “Best actor” category?
expand for full code
$age>50 & oscars$birth_pl == "New York", ] oscars[oscars
oscar_no oscar_yr award name movie age
25 25 1953 Best actress Shirley Booth Come Back, Little Sheba 54
110 24 1952 Best actor Humphrey Bogart The African Queen 52
133 47 1975 Best actor Art Carney Harry and Tonto 56
151 65 1993 Best actor Al Pacino Scent of a Woman 52
156 70 1998 Best actor Jack Nicholson As Good as It Gets 60
birth_pl birth_date birth_mo birth_d birth_y
25 New York 1898-08-30 8 30 1898
110 New York 1899-12-25 12 25 1899
133 New York 1918-11-04 11 4 1918
151 New York 1940-04-25 4 25 1940
156 New York 1937-04-22 4 22 1937
expand for full code
$age>50 & oscars$birth_pl == "New York" & oscars$award == "Best actor", ] oscars[oscars
oscar_no oscar_yr award name movie age
110 24 1952 Best actor Humphrey Bogart The African Queen 52
133 47 1975 Best actor Art Carney Harry and Tonto 56
151 65 1993 Best actor Al Pacino Scent of a Woman 52
156 70 1998 Best actor Jack Nicholson As Good as It Gets 60
birth_pl birth_date birth_mo birth_d birth_y
110 New York 1899-12-25 12 25 1899
133 New York 1918-11-04 11 4 1918
151 New York 1940-04-25 4 25 1940
156 New York 1937-04-22 4 22 1937
Factors
Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.
Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
In the data frame we just imported, let’s do
str(oscars)
'data.frame': 184 obs. of 11 variables:
$ oscar_no : int 1 2 3 4 5 6 7 8 9 10 ...
$ oscar_yr : int 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 ...
$ award : Factor w/ 2 levels "Best actor","Best actress": 2 2 2 2 2 2 2 2 2 2 ...
$ name : Factor w/ 159 levels "Adrien Brody",..: 73 110 118 106 58 94 23 11 104 104 ...
$ movie : Factor w/ 179 levels "7th Heaven","A Double Life",..: 1 33 132 84 160 89 64 37 139 137 ...
$ age : int 22 37 28 63 32 26 31 27 26 27 ...
$ birth_pl : Factor w/ 51 levels "Alabama","Arizona",..: 36 7 7 7 50 8 11 24 13 13 ...
$ birth_date: Factor w/ 158 levels "1868-04-10","1868-11-09",..: 28 9 21 2 18 29 23 34 38 38 ...
$ birth_mo : int 10 4 8 11 10 5 9 4 1 1 ...
$ birth_d : int 6 8 10 9 10 12 13 5 12 12 ...
$ birth_y : int 1906 1892 1902 1868 1900 1907 1903 1908 1910 1910 ...
We can see the names of the multiple columns. And, we see that coat is a Factor w/ 5 levels
When we read in a file, any column that contains text is automatically assumed to be a factor. Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.
You can check this by using the function levels()
, and check the number of levels using nlevels()
:
levels(oscars$birth_pl)
[1] "Alabama" "Arizona" "Australia" "Austria"
[5] "Belgium" "California" "Canada" "Connecticut"
[9] "England" "Florida" "France" "Georgia"
[13] "Germany" "Hungary" "Illinois" "India"
[17] "Iowa" "Isle of Wight" "Israel" "Italy"
[21] "Japan" "Kentucky" "Louisiana" "Massachusetts"
[25] "Michigan" "Minnesota" "Missouri" "Montana"
[29] "Nebraska" "New Jersey" "New York" "New Zealand"
[33] "North Carolina" "Ohio" "Oklahoma" "Pennsylvania"
[37] "Philly" "Poland" "Puerto Rico" "Russia"
[41] "South Africa" "Sweden" "Switzerland" "Tennessee"
[45] "Texas" "Utah" "Virginia" "Wales"
[49] "Washington" "Washington DC" "Wisconsin"
nlevels(oscars$birth_pl)
[1] 51
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows to compare levels:
<- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
satisfaction levels(satisfaction)
[1] "high" "low" "medium"
<- factor(satisfaction, levels = c("low", "medium", "high"))
satisfaction levels(satisfaction)
[1] "low" "medium" "high"
min(satisfaction) ## doesn't work
Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), levels = c("low", : 'min' not meaningful for factors
<- factor(satisfaction, levels = c("low", "medium", "high"), ordered = TRUE)
satisfaction levels(satisfaction)
[1] "low" "medium" "high"
min(satisfaction) ## works!
[1] low
Levels: low < medium < high
In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: "low"
, "medium"
, and "high"
” is more descriptive than 1
, 2
, 3
. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the species in our example data set).
Converting factors
If you need to convert a factor to a character vector, simply use as.character(x)
.
Converting a factor to a numeric vector is however a little trickier, and you have to go via a character vector. Compare:
<- factor(c(1, 5, 10, 2))
f as.numeric(f) ## wrong! and there is no warning...
[1] 1 3 4 2
as.numeric(as.character(f)) ## works...
[1] 1 5 10 2
as.numeric(levels(f))[f] ## The recommended way.
[1] 1 5 10 2