TDM 10100: R Project 2 — 2024
Motivation: R is one of the most popular tools for data analysis. Indexing and grouping values in R are very powerful. (We can do a lot, with just one line of R!)
Context: We will load several data frames in R and will practice indexing the data in several ways.
Scope: R, Operators, Conditionals
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/olympics/athlete_events.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
Questions
For this project (and moving forward, when you are using R), please use the |
Question 1 (2 pts)
Import the Olympics data from the file /anvil/projects/tdm/data/olympics/athlete_events.csv
into a data frame called myDF
. Make a table from the values in the column myDF$Year
and the plot this table. (Your work will be similar to Project 1, Questions 3, 4, 5.) [Take a look at the resulting plot: Does the resulting plot make sense? For instance: Does it make sense that the number of athletes is increasing over time? Can you see the halt in the Olympics during the two World Wars? Do you see the 2-year rotation between summer and winter Olympics began in the 1990s?]
-
A table showing the number of athletes participating in the Olympics during each year.
-
A plot showing the number of athletes participating in the Olympics during each year.
-
As always, be sure to document your work from Question 1 (and from all of the questions!), using some comments and insights about your work. We will stop adding this note to document your work, but please remember, we always assume that you will document every single question with your comments and your insights.
Question 2 (2 pts)
In the Olympics data:
Which value appears in the "NOC" column the most times?
Which value appears in the "Name" column the most times? Hint: If you try to view the entire table of values in the "Name" column, the table has length 134732, and it will not finish displaying. For this reason, you should only look at the head
or the tail
of your table, not the entire table itself.
-
The value that appears in the "NOC" column the most times.
-
The value that appears in the "Name" column the most times.
Question 3 (2 pts)
In the Olympics data:
When we examine the head
of myDF
, notice that the third row is from team "Denmark" while the fourth row is from team "Denmark/Sweden".
How many rows correspond exactly to team "Denmark"?
How many rows have "Denmark" in the team name ("Denmark" may or may not be the exact team name)? Hint: You can use the grep
or grepl
function.
Find the names of the teams that have "Denmark" in the team name but are not exactly "Denmark". Hint: There should be exactly 72 such rows.
-
The number of rows corresponding exactly to team "Denmark".
-
The number of rows with "Denmark" as part of the team name.
-
The names of teams that have "Denmark" included but are not exactly "Denmark".
Question 4 (2 pts)
Not all data comes in a comma-delimited format, i.e., with commas in between the pieces of data. In the data set of donations from the 1980 federal election campaigns, the symbol "|" is placed between pieces of data.
C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469
C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948
C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949
C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950
C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914
C00078287|A|Q1||79031231889|15||CRAMER, JOHN H|CHICAGO|IL|60685|||02281979|200||||||3061920110000383915
C00078287|A|Q1||79031231889|15||MCHUGH, KEVIN|CHICAGO|IL|60685|||03051979|200||||||3061920110000383916
C00078287|A|Q1||79031231889|15||NOHA, EDWARD J|CHICAGO|IL|60685|||03121979|300||||||3061920110000383917
C00078287|A|Q1||79031231889|15||RYCROFT, DONALD C|CHICAGO|IL|60685|||03191979|200||||||3061920110000383918
C00078287|A|Q1||79031231889|15||VANDERSLICE, WILLIAM D|CHICAGO|IL|60685|||02271979|200||||||3061920110000383919
Instead of using the read.csv
function to read in the data, we can use the fread
function to read in the data, and it will automatically detect what symbol is placed between the pieces of data. The fread
function is not available by default, so we first load the data.table
library.
This data set also does not have the names of the columns built in! So we need to specify the names of the columns.
You can use the following to read in the data and name the columns properly:
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
Now that you have the data read into the data frame myDF
, here are two questions to get familiar with the data:
Which value appears in the "STATE" column the most times?
Which value appears in the "NAME" column the most times? Hint: As in question 2, if you try to view the entire table of values in the "NAME" column, the table has length 217646, and it will not finish displaying. For this reason, you should only look at the head
or the tail
of your table, not the entire table itself.
-
The value that appears in the "STATE" column the most times.
-
The value that appears in the "NAME" column the most times.
Question 5 (2 pts)
In the data set about the 1980 federal election campaigns:
Use the paste
command to join the "CITY" and "STATE" columns, with the goal of determining the top 5 city-and-state locations where donations were made.
Hint: As in questions 2 and 4, if you try to view the entire table of values of city-and-state pairs, the table has length 217646, and it will not finish displaying. For this reason, you should only look at the head
or the tail
of your table, not the entire table itself.
Another hint: Please notice the fact that there are 11582 rows in the data set in which the "CITY" and "STATE" are both empty!
-
The top 5 city-and-state locations where donations were made in the 1980 federal election campaigns.
Submitting your Work
Great job, you’ve completed Project 2! This project was your first real foray into the world of R, and it is okay to feel a bit overwhelmed. R is likely a new language to you, and just like any other language, it will get much easier with time and practice. As we keep building on these fundamental concepts in the next few weeks, don’t be afraid to come back and revisit your previous work. As always, please ask any questions you have during seminar, on Piazza, or in office hours. We hope you have a great rest of your week, and we’re excited to keep learning about R with you in the next project!
-
firstname_lastname_project2.ipynb
You must double check your You will not receive full credit if your |