vignettes/intro-to-cdx-basic-query.Rmd
intro-to-cdx-basic-query.Rmd
When you make a resource query in the main Wayback web interface you’re tapping into the Wayback CDX API. The cdx_basic_query()
function in this package is a programmatic interface to that API.
An example use-case was presented in GitHub Issue #3 where the issue originator desired to query for historical CSV files from MaxMind.
The cdx_basic_query()
function has the following parameters (you must at least specify the url
on your own):
match_type
: “exact
” (exact URL search)collapse
: “urlkey
” (only show unique URLs)filter
: “statuscode:200
” (only show resources with an HTTP 200 response code)limit
: “1e4L
” (return 10000
records; you can go higher)For match_type
, if url
is “url: archive.org/about/
” then:
exact
” will return results matching exactly archive.org/about/
prefix
” will return results for all results under the path archive.org/about/
host
” will return results from host archive.org
domain
” will return results from host archive.org
and all subhosts *.archive.org
For collapse
the results returned are “collapsed” based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are ‘too dense’ or when looking for unique captures.
For now, filter
is limited to a single expression. This will be enhanced at a later time.
To put the use-case into practice we’ll find CSV resources and download one of them:
library(wayback)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# query for maxmind prefix
cdx <- cdx_basic_query("http://maxmind.com/", "prefix")
# filter the returned results for CSV files
(csv <- filter(cdx, grepl("\\.csv", original)))
#> # A tibble: 9 x 7
#> urlkey timestamp original mimetype statuscode digest length
#> <chr> <dttm> <chr> <chr> <chr> <chr> <dbl>
#> 1 com,max… 2009-10-18 00:00:00 http://w… text/pl… 200 WGT2V… 9.57e2
#> 2 com,max… 2003-02-23 00:00:00 http://w… text/pl… 200 2QUN2… 5.60e2
#> 3 com,max… 2003-02-23 00:00:00 http://w… text/pl… 200 NTF24… 7.86e2
#> 4 com,max… 2006-01-11 00:00:00 http://w… text/pl… 200 OFDEL… 1.21e6
#> 5 com,max… 2006-06-20 00:00:00 http://w… text/pl… 200 3INKO… 1.16e6
#> 6 com,max… 2007-11-11 00:00:00 http://w… text/pl… 200 E2AT3… 2.95e6
#> 7 com,max… 2008-07-09 00:00:00 http://w… text/pl… 200 4YRNZ… 3.76e6
#> 8 com,max… 2008-08-13 00:00:00 http://w… text/pl… 200 HG7GQ… 3.85e6
#> 9 com,max… 2014-03-02 00:00:00 http://w… text/pl… 200 MW7F7… 3.26e4
# examine a couple fields
csv$original[9]
#> [1] "http://www.maxmind.com:80/download/geoip/misc/region_codes.csv"
csv$timestamp[9]
#> [1] "2014-03-02 EST"
# read the resource from that point in time using the "raw"
# interface so as not to mangle the data.
dat <- read_memento(csv$original[9], as.POSIXct(csv$timestamp[9]), "raw")
# read it in
readr::read_csv(dat, col_names = c("iso2c", "regcod", "name"))
#> Parsed with column specification:
#> cols(
#> iso2c = col_character(),
#> regcod = col_character(),
#> name = col_character()
#> )
#> # A tibble: 4,066 x 3
#> iso2c regcod name
#> <chr> <chr> <chr>
#> 1 AD 02 Canillo
#> 2 AD 03 Encamp
#> 3 AD 04 La Massana
#> 4 AD 05 Ordino
#> 5 AD 06 Sant Julia de Loria
#> 6 AD 07 Andorra la Vella
#> 7 AD 08 Escaldes-Engordany
#> 8 AE 01 Abu Dhabi
#> 9 AE 02 Ajman
#> 10 AE 03 Dubai
#> # ... with 4,056 more rows