Archiving Everything

By Clive Python | 14jammar

Nothing lasts forever, that's a motherfucking fact Sunny Jim, but there are some good ways to archive the Web so that it can last just a little longer. Here are some helpful resources to archiving the Web as well as some useful websites that you may use.

What are robots.txt?

Before we go over some archive websites, I think it is a good idea to talk about what robots.txt are and how they work.

For the people who don't know how robots.txt work, here is an example of my own website that might help you better understand... or not...

# robots.txt file for owlman.neocities.org
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.

User-agent: *
Disallow: /Who_in_Ascii_Art.html
Disallow: /ascii.html
Disallow: /odds/robotstxtexample.txt

When broken down we can see that robots can not view the pages /Who_in_Ascii_Art.html, /ascii.html and /odds/robotstxtexample.txt, becasue of this it means that The Internet Archive will not display this particaler page

Robots.txt files live on the top level of a website at a URL like this: https://owlman.neocities.org/robots.txt. This standard was developed in 1994 to guide search engine crawlers in a variety of ways, including some areas to avoid crawling.

The robots exclusion standard (also called the robots exclusion protocol or robots.txt protocol) is a way of telling Web crawlers and other Web robots which parts of a Web site they can see, in this case we are talking about what archive bots can see and archive.

The Internet Archvie

https://archive.org/
https://web.archive.org/

The Internet Archive was founded in 1996 and is one of the most used web archive websites on the Net, I would highly rate the site as it is very easy and quick to use. The Internet Archive does not just archive websites, but it also has a big array of eBooks, movies, an audio archive, TV news archive, software collection, music, plus much more for no cost at all for the user.

One of the downsides to using the website is that it obeys robots.txt. This means if someone doesn't want a page archived - or even their whole website - they can choose not to.

The Internet Archive has two add-ons for Google Chrome and Firefox, both are the same but for different web browsers.

archive.is

https://archive.is/
https://archive.fo/
https://archive.today/
http://archive.li/

Unlike The Internet Archive archive.is does not obey robots.txt, this is because "it is not a free-walking crawler, [archive.is] saves only one page acting as a direct agent of the human user".

In short, this means that you can save web pages that you otherwise could not with The Internet Archive. One very big downsid is that you can't archive content loaded by Flash, video, audio, PDF files, RSS and other XML-pages.

HTTrack

http://www.httrack.com/

Are you really into archiving websites that you have got to the point where you download an entire website? Well, if you have, HTTrack is a good place to start.

HTTrack is very easy to use, however - and this is a very big 'however', to properly download websites, you need to be connected to the Web for what could be a long time depending on the size of the website. For example, when downloading this website (https://owlman.neocities.org/) it may only take one hour or so, but for websites such as Textfiles.com it can take up to three plus days.

One of the biggest downsides to it is that for some reason, some websites won't download, like, at all so you're fucked. The only way around this would be to manually download the site, but depending on what site it is, this can be a massive pain in the arse

GifCities

https://gifcities.org/

Owned by The Internet Archive, GifCities is basically a search engine for .gif from GeoCitie websites. There's really not much to say about this website, it's just a cool search engine that was made in celebration of The Internet Archive becoming 20 years old. So if you want to search for some kewl stuff, look no further.

textfiles.com

http://www.textfiles.com/

Setup in 1998 by Jason Scott, textfiles.com is dedicated to preserving the digital documents that contain the history of the BBS world and various subcultures. The site categorises and stores thousands of ASCII files. It focuses on text files from the 1980s, but also contains some older files and some that were created well into the 1990s... or at least according to Wikipedia...

ReoCities

http://www.reocities.com/

Born in the ashes of GeoCities, ReoCities' aim is to archive Yahoo's now-dead website host service after they pulled the plug on it in 2009. One annoying thing about the website is that whenever you view an archive, say http://reocities.com/SouthBeach/Palms/2115/ it has a big FUXKING banner at the top of the page that says "If you like the reocities.com project you can donate bitcoins to: 1E8rQq9cmv95CrdrLmqaoD6TErUFKok3bF", yeah, thanks, lads.

Google Groups

https://groups.google.com

Helpful archive for Usenet groups from the nice people at Google. The website has a large back catalogue of newsgroups started in 1995 when it was owned by The Deja News Research Service. This archive was acquired by Google in 2001. The Google Groups archive of Usenet newsgroup postings dates back to 1981.

OnionLink

http://onion.link/

Website that allows you to view Deep Web sites on a normal browser such as FireFox. If you see a Deep Web URL, simply put .link at the end, example;

http://3g2upl4pq6kufc4m.onion is now going to be http://3g2upl4pq6kufc4m.onion.link

(DDG's Tor URL, BTW)

But for the love of God, make sure the link is safe to view.


Written by Clive "James" Python, 12/09/17.

https://owlman.neocities.org/library/archive.html
https://web.archive.org/web/*/https://owlman.neocities.org/library/archive.html*