XML and JSON Files

Students cheer on the Redhawks during a sporting event at Miami University.

Both XML and JSON files are used to transmit data over the internet, and both file types follow a tree or hierarchical structure.

XML stands for "extensible markup language". XML code is very similar to HTML code in that tags are used. However, HTML is used for manipulating how the content on a page will look, whereas XML is used for signifying tags that correspond to specific data. Tags are easily recognized by looking for key words inside of ‘< >’. Tags typically begin with '<[something]>' and end the same except with a forward slash, '< /[something]>' . Notice in the following example that three sets of tags are nested within a single tag for '<grocerylist>'.

<grocerylist>
<fruit>apples, oranges, bananas, kiwi, pears</fruit>
<vegetable>lettuce, tomatoes, peppers, asparugus</vegetable>
<fridgefood>beef, chicken, eggs, milk</fridgefood>
</grocerylist>

The same XML code shown above is made into JSON code below. Note that the tree structure still illustrates how to access the nested elements.

{
"grocerylist": {
 "fruit": "apples, oranges, bananas, kiwi, pears",
 "vegetable": "lettuce, tomatoes, peppers, asparugus",
 "fridgefood": "beef, chicken, eggs, milk"
 }
}

APIs

Short for "application programming interface", an API is a set of routines and protocols used on a specific website to perform a direct search for something.  In other words, APIs can make extracting website data much easier if you plan to programmatically access web data.

There are many R packages in CRAN that automatically make use of these APIs and turn the extracted data (like JSON or XML files) into an R object, which is necessary if you want to continue to analyze the data using R. To check if there is already an R package for the API you wish to use, simply conduct an online search for "CRAN [name of the website you're interest in]". Two commonly used R packages for interacting with APIs are jsonlite and XML. The necessary functions are described below, followed by illustrative examples.

Converting Between JSON Files and R Objects

fromJSON(txt) - Used to convert a JSON file or URL to an R object

  • txt: the file or URL location for the JSON file; must be written within quotations

toJSON(x) - Takes a file type and converts it to a JSON file

  • x: the object to be made into JSON

# Install and load the necessary package
install.package(c"jsonlite")
library ("jsonlite")

Next, we will use the article search API for NYTimes.com to access JSON information on articles concerning Mircosoft.

JSON_data <- fromJSON("https://api.nytimes.com/svc/search/v2/articlesearch.json?api-key=[paste your personally requested key code here]&q='Microsoft'")

Notice the argument for fromJSON() is contained within quotation marks and is made up of four separate parts. First is the location information for the API https://api.nytimes.com/svc/search/v2/articlesearch.json, then ? to indicate the start of instructions for what you want the API to do. This includes an api key api-key=[paste your personally requested key code here], which we had to request and have emailed to use before we could use their API service. This is so that the developer can contact us if we are overloading their servers with too many requests and kick us off or even deny us future access if we are doing something that makes us appear as a serious cyber attack. At the end we have &q='Microsoft', which simply indicates that we want to search for any articles concerning Mircosoft.

Because the output from fromJSON() is a list, we can access internal elements of JSON_data via the $ symbol. For more on indexing into lists, see Basic Syntax in the Beginners R tutorials.

response <- JSON_data$response

# and we can further index into deeper nested content
JSON_data$response$docs$document_type
## [1] "topic"   "topic"   "topic"   "article" "article" "article" "article" "article"
## [9] "article" "article"
 
JSON_data$response$meta$hits
## [1] 26748

Working the other direction, we can use toJSON() to convert response into a JSON object, and then save this information back into a JSON file type.

# Note: 'pretty = TRUE' option adds indentation whitespace
myjson <- toJSON (JSON_data, pretty = TRUE)

# To save as a JSON file, first define a connection to a JSON file
fileConn <- file("NYTimes.json")  

# Then write out the content and end by closing the connection
writeLines(myjson, fileConn)
close(fileConn)

Converting Between XML Files and R Objects

xmlParse(txt) - Used to convert a XML file or URL to an R object

  • txt: the file or URL to get the XML file from

# Install and load necessary package
library(XML)

# Convert Xml file into an R object
xmlfile <- xmlParse("C:\\Users\\Username\\Documents\\groceryList.xml")

# Creating defining top node and using this to access internal elements 
xmltop <- xmlRoot(xmlfile)
xmltop[[1]]
xmltop[[2]]
xmltop[[2]][1]

Need a Refresher?

Go back to the beginner tutorials.