The Definitive Guide to Solving Week 2 Question 5

Question 5 in the week 2 quiz for Getting and Cleaning Data in the Johns Hopkins Data Science Specialization on Coursera is deceptively simple: report the sum of the fourth of the nine columns. To answer this question correctly, one MUST view the file before writing R code. As noted in Strategy for Reading Files & APIs / Quiz 2, one can download the Atom editor for free and use it to view the file.

Nine columns? At first glance the screen capture looks like 5 columns, not 9. However, there are truly 9 columns of data in this file, including:

Week
Nino1and2SST
Nino1and2SSTA
Nino3SST
Nino3SSTA
Nino34SST
Nino34SSTA
Nino4SST
Nino4SSTA

For detailed descriptions of each data element, please review the NOAA SST and SSTA Data Set Codebook included in the Appendix of this article.

The subtlety one must see to distinguish 9 columns from the list of 5 is that for some rows, the initial character for some of the columns is a minus sign.

The challenge is how to read the fourth column, Nino3SST, correctly. The columns are real numbers, not integers as stated in a Discussion Forum post on Coursera. That is, they contain decimal points and one value to the right of the decimal point.

A text editor such as UltraEdit that provides a column number display makes it a bit easier to know which columns to read.

The negative signs and column numbers must both be accounted for in the R function used to read the data.

Additional Considerations

To answer the quiz question correctly, students must read the correct data file. The quiz question provides two URLs, one to be used for the quiz and another that is a link to the original data source that is continually updated.

Many students get frustrated because they read the original data source directly from the NOAA website, which has more records than the file to be used for the quiz. The correct file has 1,254 observations, starting with data from January 3, 1990 and ending on January 8, 2014, as illustrated below.

As of September 14th 2020, the data from the NOAA has 1,601 observations, the most recent of which is from September 2nd, 2020.

[PRO TIP] Use the summary() function to compare the data set you load into R against the values listed in the codebook below to confirm that the data file has been read correctly.

Partial Solutions: choosing the right R function is key

Once we know that the data is in a fixed record format, and particularly a fortran format file, a solution is straightforward once one figures out how to specify the fortran formats needed as an argument to read.fortran(). Here is a stub of a solution with read.fortran() from the utils package.

originalSource <- "https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for"
download.file(originalSource,"./data/wksst8110.for")
fileURL <- "./data/wksst8110.for"
# set addresses for fixed length fortran-style input file 
theAddresses <- c(*** magic goes here ***)
mydata <- read.fortran(file=fileURL,theAddresses,skip = 4)
theColumns <- c("week","nino1and2sst","nino1and2ssta","nino3sst",
            "nino3ssta","nino34sst","nino34ssta",
            "nino4sst","nino4ssta")
colnames(mydata) <- theColumns

However, there is almost always more than one way to accomplish a task in R. A partial solution with the read.fwf() function from Base R looks like this, where one must also configure a column specification object that is used as an argument to the function.

fwfCols <- c(*** magic goes here ***)
mydata2 <- read.fwf(fileURL,widths=fwfCols,skip=4,
                    col.names=theColumns)

Another option is readr::read_fwf().

library(readr)
readrData <- read_fwf(fileURL,
                      fwf_cols(*** magic goes here ***),skip=4)

At this point one can write the remaining code required to answer the question.

Appendix

The data set is maintained by the United States National Oceanographic and Atmospheric Administration. SST is an abbreviation of sea surface temperature. SSTA is an abbreviation of sea surface temperature anomaly. Anomaly is calculated as the deviation from a 30 year average. for the Sea Surface Temperature Index data set, the mean includes data collected within the El Niño Southern Oscillation between 1981 and 2010.

The El Niño Southern Oscillation is a combined atmospheric and ocean system consisting of four regions for which the NOAA collects data, as illustrated below.

El Niño Southern Oscillation

Codebook

[PRO TIP] Bookmark this article because you can use the codebook illustrated here as a template for your final project in Getting and Cleaning Data. Ironically, most of the data sets on the NOAA website have no codebooks or they are very difficult to find and poorly written, as students will discover during the Reproducible Research course within the Data Science Specialization.

References

Climate Glossary: Climate Prediction Center of the National Oceanographic and Atmospheric Administration, United States of America, retrieved 14 October 2017.

El Niño Southern Oscillation: Climate Prediction Center of the National Oceanographic and Atmospheric Administration, United States of America, retrieved 14 October 2017.

Monthly Atmospheric and Sea Surface Temperature Indices: Climate Prediction Center of the National Oceanographic and Atmospheric Administration, United States of America, retrieved 14 October 2017.

Optimum Interpolation Sea Surface Temperatures - V2: Climate Prediction Center of the National Oceanographic and Atmospheric Administration, United States of America, retrieved 14 October 2017.

Last Updated: 14 September 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

cleaningData-week2Q5.md

cleaningData-week2Q5.md

The Definitive Guide to Solving Week 2 Question 5

Additional Considerations

Partial Solutions: choosing the right R function is key

Appendix

El Niño Southern Oscillation

Codebook

References

Collapse file tree

Files

cleaningData-week2Q5.md

Latest commit

History

cleaningData-week2Q5.md

File metadata and controls

The Definitive Guide to Solving Week 2 Question 5

Additional Considerations

Partial Solutions: choosing the right R function is key

Appendix

El Niño Southern Oscillation

Codebook

References