Archiving Everything
By Clive Python | 14jammar
Nothing lasts forever, that's a motherfucking fact Sunny Jim, but there are
some good ways to archive the Web so that it can last just a little longer.
Here are some helpful resources to archiving the Web as well as some useful
websites that you may use.
Before we go over some archive websites, I think it is a good idea to talk about what robots.txt are and how they work.
For the people who don't know how robots.txt work, here is an example of my own website that might help you better understand... or not...
User-agent: *
Disallow: /Who_in_Ascii_Art.html
Disallow: /ascii.html
Disallow: /odds/robotstxtexample.txt
When broken down we can see that robots can not view the pages /Who_in_Ascii_Art.html,
/ascii.html and /odds/robotstxtexample.txt, becasue of this it means
that The Internet Archive will not display this particaler page
Robots.txt files live on the top level of a website at a URL
like this: https://owlman.neocities.org/robots.txt.
This standard was developed in 1994 to guide search engine crawlers in a variety of ways, including some areas to avoid crawling.
The robots exclusion standard (also called the robots exclusion protocol or robots.txt protocol) is
a way of telling Web crawlers and other Web robots which parts of a Web site they can see, in this
case we are talking about what archive bots can see and archive.
https://archive.org/
https://web.archive.org/
The Internet Archive was founded in 1996 and is one of the most used web archive websites on the
Net, I would highly rate the site as it is very easy and
quick to use. The Internet Archive does not just archive websites, but it also has a big array of
eBooks,
movies,
an audio archive,
TV news archive,
software collection,
music,
plus much more for no cost at all for the user.
One of the downsides to using the website is that it obeys robots.txt. This means if
someone doesn't want a page archived - or even their whole website - they can choose not to.
The Internet Archive has two add-ons for Google Chrome
and
Firefox,
both are the same but for different web browsers.
https://archive.is/
https://archive.fo/
https://archive.today/
http://archive.li/
Unlike The Internet Archive archive.is does not obey robots.txt, this is
because "it is not a free-walking crawler, [archive.is] saves only
one page acting as a direct agent of the human user".
In short, this means that you can save web pages that you otherwise
could not with The Internet Archive. One very big downsid is that you
can't archive content loaded by Flash, video, audio, PDF
files, RSS and other XML-pages.
Are you really into archiving websites that you have got to the point where you
download an entire website? Well, if you have, HTTrack is a good place to start.
HTTrack is very easy to use, however - and this is a very big 'however', to
properly download websites, you need to be connected to the Web for what
could be a long time depending on the size of the website. For example,
when downloading this website
(https://owlman.neocities.org/)
it may only
take one hour or so, but for websites such as Textfiles.com it can take
up to three plus days.
One of the biggest downsides to it is that for some reason, some websites
won't download, like, at all so you're fucked. The only way around this
would be to manually download the site, but depending on what site it
is, this can be a massive pain in the arse
Owned by The Internet Archive, GifCities is basically a search engine for
.gif from GeoCitie websites. There's really not much to say about this
website, it's just a cool search engine that was made in celebration of
The Internet Archive becoming 20 years old. So if you want to search for
some kewl stuff, look no further.
Setup in 1998 by Jason Scott, textfiles.com is dedicated to
preserving the digital documents that contain the history of
the BBS world and various subcultures. The site categorises
and stores thousands of ASCII files. It focuses on text files
from the 1980s, but also contains some older files and some
that were created well into the 1990s...
or at least according to Wikipedia...
Born in the ashes of GeoCities, ReoCities' aim is to archive Yahoo's
now-dead website host service after they pulled the plug on it in
2009. One annoying thing about the website is that whenever you
view an archive, say http://reocities.com/SouthBeach/Palms/2115/
it has a big FUXKING banner at the top of the page that says "If
you like the reocities.com project you can donate bitcoins to:
1E8rQq9cmv95CrdrLmqaoD6TErUFKok3bF", yeah, thanks, lads.
Helpful archive for Usenet groups from the nice people at Google.
The website has a large back catalogue of newsgroups started in
1995 when it was owned by The Deja News Research Service. This
archive was acquired by Google in 2001. The Google Groups archive
of Usenet newsgroup postings dates back to 1981.
Website that allows you to view Deep Web sites on a normal browser such as FireFox. If you see a Deep Web URL, simply put .link at the end, example;
http://3g2upl4pq6kufc4m.onion is now going to be http://3g2upl4pq6kufc4m.onion.link
But for the love of God, make sure the link is safe to view.
What are robots.txt?
# robots.txt file for owlman.neocities.org
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.
The Internet Archvie
archive.is
HTTrack
GifCities
textfiles.com
ReoCities
Google Groups
OnionLink
Written by Clive "James" Python, 12/09/17.
https://owlman.neocities.org/library/archive.html
https://web.archive.org/web/*/https://owlman.neocities.org/library/archive.html*