vignettes/intro-to-mementos.Rmd
intro-to-mementos.Rmd
We can use wayback
to look at many historical things. Here’s how to dive into saved site RSS feeds. We’ll search the Internet Archive for these saved RSS feed documents (“mementos”). You can research a bit more about web archiving terminology (it’s a bit arcane — at times — IMO) via http://www.mementoweb.org/guide/quick-intro/ & https://mementoweb.org/guide/rfc/ as starter resources.
First, we get the recorded mementos (basically a short-list of relevant content):
(rss <- get_mementos("http://www.dailyecho.co.uk/news/district/winchester/rss/"))
#> # A tibble: 7 x 3
#> link rel ts
#> <chr> <chr> <dttm>
#> 1 http://www.dailyecho.co.uk/news/district/w… original NA
#> 2 http://web.archive.org/web/timemap/link/ht… timemap NA
#> 3 http://web.archive.org/web/http://www.dail… timegate NA
#> 4 http://web.archive.org/web/20090517035444/… first m… 2009-05-17 03:54:44
#> 5 http://web.archive.org/web/20180712045741/… prev me… 2018-07-12 04:57:41
#> 6 http://web.archive.org/web/20180812213013/… memento 2018-08-12 21:30:13
#> 7 http://web.archive.org/web/20180812213013/… last me… 2018-08-12 21:30:13
The calendar-menu viewer thing at IA is really the “timemap”. I like to work with this as it’s the point-in-time memento list of all the crawls. It’s the second link above so we’ll read it in:
(tm <- get_timemap(rss$link[2]))
#> # A tibble: 46 x 5
#> rel link type from datetime
#> <chr> <chr> <chr> <chr> <chr>
#> 1 original http://www.dailyecho.co… <NA> <NA> <NA>
#> 2 self http://web.archive.org/… applicat… Sun, 17 … <NA>
#> 3 timegate http://web.archive.org <NA> <NA> <NA>
#> 4 first memento http://web.archive.org/… <NA> <NA> Sun, 17 May…
#> 5 memento http://web.archive.org/… <NA> <NA> Thu, 13 Aug…
#> 6 memento http://web.archive.org/… <NA> <NA> Thu, 12 Nov…
#> 7 memento http://web.archive.org/… <NA> <NA> Tue, 12 Jan…
#> 8 memento http://web.archive.org/… <NA> <NA> Mon, 12 Jul…
#> 9 memento http://web.archive.org/… <NA> <NA> Sat, 27 Nov…
#> 10 memento http://web.archive.org/… <NA> <NA> Wed, 29 Jun…
#> # ... with 36 more rows
The content is in the mementos and there should be as many mementos there as you see in the calendar view. We’ll read in the first one:
Ideally use writeLines()
, now, to save this to disk with a good filename. Alternatively, stick it in a data frame with metadata and saveRDS()
it. But, that’s not a format others (outside R) can use so perhaps do the data frame thing and stream it out as ndjson
with jsonlite::stream_out()
and compress it during save or afterwards.
Then convert it to something we can use programmatically with xml2::read_xml()
or xml2::read_html()
(RSS is sometimes better parsed as XML):
xml_find_all(
read_xml(mem),
".//title"
)
#> {xml_nodeset (52)}
#> [1] <title>Daily Echo | Winchester</title>
#> [2] <title>Daily Echo | Winchester</title>
#> [3] <title>Flasher exposes himself to woman jogger near Winchester</title>
#> [4] <title>Man arrested for string of burglaries across Hampshire</title>
#> [5] <title>Winchester man jailed for restaurant burglary</title>
#> [6] <title>Anti-social behaviour on the rise in Alresford</title>
#> [7] <title>Moped rider banned for drug driving</title>
#> [8] <title>Shoplifter ordered to pay compensation to Boots</title>
#> [9] <title>Councillor applies for change of use of GP surgery</title>
#> [10] <title>Teenager convicted of string of offences</title>
#> [11] <title>Merrydale respite centre set to be axed despite overwhelming ...
#> [12] <title>Drink driver receives suspended jail sentence for failing to ...
#> [13] <title>Hospital staff anger at order to use park+ride</title>
#> [14] <title>Winchester man caught flashing at nature reserve</title>
#> [15] <title>Two Hampshire Parkrun events cancelled after storms and torr ...
#> [16] <title>County council buys its first electric vehicles - and hopes ...
#> [17] <title>Hospital operations start to be cancelled across Hampshire</ ...
#> [18] <title>Owners of bar left 'devastated' after front window is smashe ...
#> [19] <title>Spring into Theatre Royal Winchester season</title>
#> [20] <title>Cheese and Chilli Festival returns to Winchester</title>
#> ...
read_memento()
has an as
parameter to automagically parse the result but I like to store the mementos locally (as noted previously) so as not to abuse the IA servers (i.e. if I ever need to get the data again I don’t have to hit their infrastructure).
A big caveat is that if you try to get too many resources from the IA in a short period of time you’ll get temporarily banned as they have scale but it’s a free service and they (rightfully) try to prevent abuse.