Perform a basic/limited Internet Archive CDX resource query for a URL

CDX files are "Content Index" files. The Wayback CDX server is a standalone HTTP servlet that serves the index that the Wayback machine uses to lookup captures.

cdx_basic_query(url, match_type = c("exact", "prefix", "host", "domain"),
  collapse = "urlkey", filter = "statuscode:200", limit = 10000L)

Arguments

url	URL/resource to query for
match_type	The CDX server can also return results matching a certain prefix, a certain host or all subdomains. Can be one of `"exact"`, `"prefix"`, `"host"`, or `"domain"` (defaults to `exact`).
collapse	collapse results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures. To use collapsing, add one or more `collapse=field` or `collapse=field:N` where `N` is the first `N` characters of field to test. Use `NULL` for no collapsing Default is to collapse by `urlkey` (like the web UX). Reference: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server.
filter	a valid filter string (without the `filter=` or `NULL`. The default filter string is `statuscode:200` to only retrieve resources with an HTTP `200` (`OK`) status code. Set to `NULL` for no filtering.
limit	Maximum number of results to return (first n results). Use a negative number to retrieve the last n results. Default is `10,000`.

Value

data frame

Details

The index format is known as 'cdx' and contains various fields representing the capture, usually sorted by url and date. http://archive.org/web/researcher/cdx_file_format.php.

Examples

# NOT RUN {
rproj_basic <- cdx_basic_query("https://www.r-project.org/")

dplyr::glimpse(rproj_basic)
## Observations: 10,000
## Variables: 7
## $ urlkey     <chr> "org,r-project)/", "org,r-project)/", "org,r-project)/"...
## $ timestamp  <dttm> 2000-06-20, 2000-08-16, 2000-10-12, 2000-11-10, 2000-1...
## $ original   <chr> "http://www.r-project.org:80/", "http://www.r-project.o...
## $ mimetype   <chr> "text/html", "text/html", "text/html", "text/html", "te...
## $ statuscode <chr> "200", "200", "200", "200", "200", "200", "200", "200",...
## $ digest     <chr> "XDIHHFDLIWSZFHYHT453ZL5FYPCKFF6Z", "SRO3WSKQS6HST4PQY7...
## $ length     <dbl> 4894, 5027, 589, 581, 582, 596, 590, 592, 592, 592, 563...
# }

Perform a basic/limited Internet Archive CDX resource query for a URL

Arguments

Value

Details

Examples

Contents