Intro to the ‘CDX Basic Query’ Interface

When you make a resource query in the main Wayback web interface you’re tapping into the Wayback CDX API. The cdx_basic_query() function in this package is a programmatic interface to that API.

An example use-case was presented in GitHub Issue #3 where the issue originator desired to query for historical CSV files from MaxMind.

The cdx_basic_query() function has the following parameters (you must at least specify the url on your own):

match_type: “exact” (exact URL search)
collapse: “urlkey” (only show unique URLs)
filter: “statuscode:200” (only show resources with an HTTP 200 response code)
limit: “1e4L” (return 10000 records; you can go higher)

For match_type, if url is “url: archive.org/about/” then:

“exact” will return results matching exactly archive.org/about/
“prefix” will return results for all results under the path archive.org/about/
“host” will return results from host archive.org
“domain” will return results from host archive.org and all subhosts *.archive.org

For collapse the results returned are “collapsed” based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are ‘too dense’ or when looking for unique captures.

For now, filter is limited to a single expression. This will be enhanced at a later time.

To put the use-case into practice we’ll find CSV resources and download one of them:

library(wayback)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# query for maxmind prefix
cdx <- cdx_basic_query("http://maxmind.com/", "prefix")

# filter the returned results for CSV files
(csv <- filter(cdx, grepl("\\.csv", original)))
#> # A tibble: 9 x 7
#>   urlkey   timestamp           original  mimetype statuscode digest length
#>   <chr>    <dttm>              <chr>     <chr>    <chr>      <chr>   <dbl>
#> 1 com,max… 2009-10-18 00:00:00 http://w… text/pl… 200        WGT2V… 9.57e2
#> 2 com,max… 2003-02-23 00:00:00 http://w… text/pl… 200        2QUN2… 5.60e2
#> 3 com,max… 2003-02-23 00:00:00 http://w… text/pl… 200        NTF24… 7.86e2
#> 4 com,max… 2006-01-11 00:00:00 http://w… text/pl… 200        OFDEL… 1.21e6
#> 5 com,max… 2006-06-20 00:00:00 http://w… text/pl… 200        3INKO… 1.16e6
#> 6 com,max… 2007-11-11 00:00:00 http://w… text/pl… 200        E2AT3… 2.95e6
#> 7 com,max… 2008-07-09 00:00:00 http://w… text/pl… 200        4YRNZ… 3.76e6
#> 8 com,max… 2008-08-13 00:00:00 http://w… text/pl… 200        HG7GQ… 3.85e6
#> 9 com,max… 2014-03-02 00:00:00 http://w… text/pl… 200        MW7F7… 3.26e4

# examine a couple fields
csv$original[9]
#> [1] "http://www.maxmind.com:80/download/geoip/misc/region_codes.csv"

csv$timestamp[9]
#> [1] "2014-03-02 EST"

# read the resource from that point in time using the "raw" 
# interface so as not to mangle the data.
dat <- read_memento(csv$original[9], as.POSIXct(csv$timestamp[9]), "raw")

# read it in
readr::read_csv(dat, col_names = c("iso2c", "regcod", "name"))
#> Parsed with column specification:
#> cols(
#>   iso2c = col_character(),
#>   regcod = col_character(),
#>   name = col_character()
#> )
#> # A tibble: 4,066 x 3
#>    iso2c regcod name               
#>    <chr> <chr>  <chr>              
#>  1 AD    02     Canillo            
#>  2 AD    03     Encamp             
#>  3 AD    04     La Massana         
#>  4 AD    05     Ordino             
#>  5 AD    06     Sant Julia de Loria
#>  6 AD    07     Andorra la Vella   
#>  7 AD    08     Escaldes-Engordany 
#>  8 AE    01     Abu Dhabi          
#>  9 AE    02     Ajman              
#> 10 AE    03     Dubai              
#> # ... with 4,056 more rows

Bob Rudis

2018-09-18