Skip to content

Latest commit

 

History

History

week07-architecture-and-archives

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

CICF Week 7

The goals for this week are to

  1. Call a web service from the command line
  2. Be able to specify HTTP headers with curl requests
  3. Be able to manipulate JSON files

Tutorial

This week we will look at web services.

Start by looking at the human version of a web page. Open your VM, and then open the web browser and visit

https://orcid.org/0000-0002-1825-0097

This is (should be?) the only fictional person with an ORCID record. THe page displays his name and some information about him.

Lets look at this under the hood. Make a new browser tab and go to this website:

https://base64.guru/tools/http-request-online

We will first look at HTTP requests with this and then from the command line. Enter the previous ORCID URL into the URL box. Choose HTTP request version 1.1.

We see the request sent. It has the HTTP method, GET, and a few headers. These headers are standard boilerplate.

Then below we see the response headers. The first line has the response code, in this case "200 OK". We have some more headers describing the data: it is text/html. There are some other headers, some are important to the client, and some are for debugging.

Below the response headers is the response body, and we have some HTML encoded text which is the displayed webpage. So this shows the distinction between HTTP—the transport protocol—and HTML—the text that forms the "web page".

We can add other headers to our request. Of course, if the server doesn't understand a header it can ignore it or return an error, its choice.

I would then look at this using the JSON response ORCID can provide, but the website now requires a sign-in before providing this. SO, lets look at OpenAlex.

Open Alex

Surprisingly, there is no complete database of all academic scholarship. There are a few aggregators that try to index as much as they can. One is Google Scholar, others are DataCite Commons, and OpenAlex. There are also more specilized databases, such as PubMed for medical research.

OpenAlex is a catalog of open science papers, people, datasets, instituions, and so on. In the browser visit the page:

https://openalex.org/works/w2764299839

This, again, is a human readable page provided by the catalog. Lets try asking for a JSON representation. Add the header Accepts: application/json by typing that into the box labeled "HTTP Request Headers". This is asking the server that we don't want an HTML page, instead we want a JSON encoded response. In this case, we get a page that wants us to use javascript. This seems to be a newer techneque to prevent bots from scraping data off a page. But the information is all available at the API endpoint:

https://api.openalex.org/works/W2764299839

Now we get an interesting response. The first line has a 302 response code. This is the server telling us that we need to retry at a different URL. The Location: header is telling us the new URL to use. Why? It seems to want us to use a capital "W". The second request returns a JSON response body. It is all on one line. Sometimes servers do this, since the line breaks are not needed to decode the JSON.

Thinking of this architecture, why do you think the servers used a redirect rather than just returning the JSON in the first place?

Looking at the JSON

Copy the JSON response and paste it into this web page:

https://jqplay.org/

Paste it into the box labeled "JSON". In the box labeled "filter" enter ., a single period.

You will see a formatted version of the JSON appear in the box on the right. JSON is a simple way of structuring data to send between computers. Since it is text-based, it is easy for people to inspect it. However there is no support for comments, so it is not ideal for ongoing things that a human might be editing, such as configuration files.

There are 6 kinds of values in JSON:

  • numbers
  • strings
  • true/false
  • null
  • objects
  • arrays

Most JSON responses are an object, which is indicated by a matching pair of curly braces, {}. Inside the curly braces of an object there are a list of key-value pairs separated by commas.

All of the information in HTML record should also appear in the JSON record.

Try entering .title in the Filter box. You should see the following JSON:

"Citizen science provides a reliable and scalable tool to track disease-carrying mosquitoes"

Now try .mesh. You should see a big list. Now do .mesh[3]:

{
  "descriptor_ui": "D009032",
  "descriptor_name": "Mosquito Control",
  "qualifier_ui": "Q000379",
  "qualifier_name": "methods",
  "is_major_topic": true
}

The filter box takes a pattern and returns the pieces of the input that match.

MeSH are subject headings curated by the National Library of Medicine. Lets look up this term:

https://www.ncbi.nlm.nih.gov/mesh/

Search for D009032. This lets us share topic headings with others and we can all agree on what they mean. We can also agree on the codes used to represent each topic.

Vocabularies like MeSH are very useful, but each takes effort to develop and there all have a defined scope. Another useful place to define shared terms is WikiData.

https://www.wikidata.org

And we can also find the Wikidata term for MeSH:

https://www.wikidata.org/wiki/Q2003646

Again on the command line

Now lets do all this on the command line.

curl -H 'Accepts: application/json' 'https://api.openalex.org/works/w2764299839'

This just returns the redirect. We need to ask "curl" to follow the redirects:

curl -L -H 'Accepts: application/json' 'https://api.openalex.org/works/w2764299839'

We can see more informatio being passed with the -v "verbose" option.

curl -v -H 'Accepts: application/json' 'https://api.openalex.org/works/w2764299839' 2>&1 | less

Note that the request is on lines starting with a ">" and the response headers are on lines starting with "<".

Lets save the json response:

curl -L -H 'Accepts: application/json' 'https://api.openalex.org/works/w2764299839' > mosq.json

The jq tool can work on the command line as well.

jq .mesh[3] mosq.json

Resources

Software Architecture

Jeff Bezos on two types of decisions:

Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups.