December 8, 2021

Archiving read-only versions of websites using curl

Sometimes a project is finished. The chapter is over. A campaign is won or lost, an app is retired, or you don't want to keep something up-to-date but you want a historical archive.

wget -mkx -e robots=off http://example.com

wget --mirror \
 --convert-links \
 --adjust-extension \
 --page-requisites \
 --no-parent \
 -w 5 \
 --no-if-modified-since \ 
 --no-http-keep-alive 
 --span-hosts 
 --domain-list=example.com,static1.squarespace.com,images.squarespace-cdn.com example.com

A quick summary of the options:

-w 5 = wait 5 seconds between requests

--no-if-modified-since allows the forcing if you have local copies

--no-http-keep-alive starts a new session each time, useful if you are getting 429 too many requests on the same channel (use with a longer timeout)

--span-hosts => enables spanning across hosts when doing recursive retrieving 

--domain-list=<comma separated domain list> => used to set the domains to be followed for file retrieving

If you have a website hosted somewhere, you probably have assets that come from the host's platform, so spanning multiple domains for the given allows you to get everything.

Once done, you can post to a static site host, adjust DNS, and close that chapter.