R/cdx_basic.r
cdx_basic_query.Rd
CDX files are "Content Index" files. The Wayback CDX server is a standalone HTTP servlet that serves the index that the Wayback machine uses to lookup captures.
cdx_basic_query(url, match_type = c("exact", "prefix", "host", "domain"), collapse = "urlkey", filter = "statuscode:200", limit = 10000L)
url | URL/resource to query for |
---|---|
match_type | The CDX server can also return results matching a certain
prefix, a certain host or all subdomains. Can be one of
|
collapse | collapse results based on a field, or a substring of a field.
Collapsing is done on adjacent cdx lines where all captures after the
first one that are duplicate are filtered out. This is useful for filtering
out captures that are 'too dense' or when looking for unique captures.
To use collapsing, add one or more |
filter | a valid filter string
(without the |
limit | Maximum number of results to return (first n results). Use a
negative number to retrieve the last n results. Default is |
data frame
The index format is known as 'cdx' and contains various fields representing the capture, usually sorted by url and date. http://archive.org/web/researcher/cdx_file_format.php.
# NOT RUN { rproj_basic <- cdx_basic_query("https://www.r-project.org/") dplyr::glimpse(rproj_basic) ## Observations: 10,000 ## Variables: 7 ## $ urlkey <chr> "org,r-project)/", "org,r-project)/", "org,r-project)/"... ## $ timestamp <dttm> 2000-06-20, 2000-08-16, 2000-10-12, 2000-11-10, 2000-1... ## $ original <chr> "http://www.r-project.org:80/", "http://www.r-project.o... ## $ mimetype <chr> "text/html", "text/html", "text/html", "text/html", "te... ## $ statuscode <chr> "200", "200", "200", "200", "200", "200", "200", "200",... ## $ digest <chr> "XDIHHFDLIWSZFHYHT453ZL5FYPCKFF6Z", "SRO3WSKQS6HST4PQY7... ## $ length <dbl> 4894, 5027, 589, 581, 582, 596, 590, 592, 592, 592, 563... # }