Skip to content

Links for Web Archiving

marigolds, front garden, home, falmouth, Virginia, US

Possible Applications to Try:

Information about Website compression:

Tools:

Web tool that examines the size gains of website compression. Downloadable Website Analyser

http://webarchivist.org/resources.htm

Below: Some  annotated text taken from Web Archiving Resources Office for Information Systems, Harvard University Libraries

Harvesting Services

ArchiveIt

A subscription harvesting service provided by the Internet Archive. Through a web based interface, users can capture, catalogue and archive their institution’s own web site or build additional collections, and then search and browse the collection when complete.
http://www.archive-it.org/

Harvesting Software

Open source harvesting software

Combine Harvesting Robot
http://www.lub.lu.se/combine/
Harvesting and indexing software written in Perl and C++ and under the GPL license. Once used (and still?) by Swedish, Danish and Austrian archives. Do not know if this is actively developed anymore.

GNU Wget
http://www.gnu.org/software/wget/wget.html
A non-interactive command-line tool under the GPL license that can be used from scripts and other programs.

Heritrix, Internet Archive and Nordic National Libraries
http://crawler.archive.org/
A robust web archiving harvester under the LGPL license. Has very flexible means to configure and control the harvest. Designed to be extensible by writing new Java modules. Configurable through a web interface. This work is sponsored by the IIPC (International Internet Preservation Consortium).

HTTrack
http://www.httrack.com/
Offline browsers under the GPL license that can be used from a graphical interface or the command line.

Nalanda iVia Focused Crawler (NIFC)

http://ivia.ucr.edu/projects/Nalanda/

Designed to find Web resources with the same topic as a seed set of known resources. NIFC was created by Dr. Soumen Chakrabarti at the Indian Institute of Technology (Bombay), and further developed in collaboration with the iVia team.

Nedlib Harvester, Center for Scientific Computing – the Finnish IT Center for Science
http://www.csc.fi/sovellus/nedlib/
Developed as a part of the Nedlib project funded by the European Union. Written in C and dependent on the MySQL database. No longer supported or developed.

Commercial harvesting software

Internet Researcher, Zylox Software
http://www.zylox.com/
A Windows-only offline browsing tool with a graphical interface.

Offline Explorer and Mass Downloader, MetaProducts Software Corporation
http://www.metaproducts.com/mp/mpProducts_List.asp
Various offline browsing tools for Windows. The MetaProducts Offline Explorer Pro 2.1 is used by DACHS (Digital Archive for Chinese Studies) - http://www.sino.uni-heidelberg.de/dachs/

RafaBot, Spadix Software
http://www.spadixbd.com/rafabot/
A Windows-only offline browsing tool with a graphical interface. As well as supplying it with a list of URLs, can give it search terms and RafaBot will download all matching web sites using search engines.

SuperBot, Sparkleware
http://www.sparkleware.com/dl.html
A simple Windows-only offline browsing tool with a graphical interface.

SurfSaver, askSam Systems
http://www.surfsaver.com/
An add-on to Microsoft Internet Explorer.

Teleport Webspiders
http://www.tenmax.com/teleport/home.htm
Various sophisticated Windows-only versions with different interfaces (graphical, console, scriptable) and feature sets.

WebCopier, MaximumSoft Corp.
http://www.maximumsoft.com/index.html
An offline browsing tool with a graphical interface in multiple versions for different operating systems and performance/feature level.

Discovery, Display and Access Software

ARC Access Tools
http://archive-access.sourceforge.net/
Internet Archive’s list of tools for processing and accessing content in ARC files.

Kea

http://www.nzdl.org/Kea/

A GPLed tool for automatic keyword extraction from text documents. Originally written in a combination of Perl, C and Java; now available in an all-Java version. From the New Zealand
Digital Library at the University of Waikato, New Zealand.

libiViaMetadata

http://ivia.ucr.edu/manuals/libiViaMetadata/current/

A GPLed C++ library for assigning descriptive metadata to web files. Developed under the iVia Project. Includes the PhraseRate program which is described at http://ivia.ucr.edu/projects/PhraseRate/

NutchWAX (Nutch + Web Archive eXtensions), Internet Archive and Nordic National Libraries
http://archive-access.sourceforge.net/projects/nutch/gettingstarted.html
A tool for indexing and searching web archives. Currently works only with the Arc format
(http://www.archive.org/web/researcher/ArcFileFormat.php).
Implemented as a Java servlet. Add parsers to handle different formats, e.g. xpdf for PDF files. This work is sponsored by the IIPC (International Internet Preservation Consortium).

Wayback, Internet Archive

http://archive-access.sourceforge.net/projects/wayback/

The open source version of the Internet Archive’s proprietary search and display interface, the “Wayback Machine” (listed next).

Wayback Machine, Internet Archive
http://www.archive.org/web/web.php
A proprietary interface to the Internet Archive’s huge collection of web pages archived from 1996 to the present.

WERA (Web Archive Access), Internet
Archive and National Library of Norway http://nwa.nb.no/
An archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections as well as the possibility to do full text search and easy navigation between different versions of a web page. WERA is based on, and replaces the NwaToolset. It uses the NutchWAX search engine and is written in PHP and Java. This work is sponsored by the IIPC (International Internet Preservation Consortium).

General Web Archiving Suites

Software that is more of a system of web archiving tools rather that individual applications

DataFountains

http://ivia.ucr.edu/manuals/DataFountains/1.0.0/

A tool for discovering, harvesting
and describing web resources. Developed under the iVia Project.

PANDAS (PANDORA Digital Archiving
System), National Library of Australia http://pandora.nla.gov.au/pandas.html Tools for controlling the harvest, conducting quality assurance checking, initiating archiving processes, managing the metadata including access restrictions, and producing management reports. Uses the HTTrack harvester. PANDAS was created to enable very selective harvesting and is not intended for large-scale automated harvests. The developers of this software are re-engineering PANDAS to use IIPC tools like Heretrix and WERA, and to
be better integrated with their digital repository.

WebArchivist Software Suite, SUNY Institute of Technology and University of Washington
http://www.webarchivist.org/resources.htm
Tools for entering metadata, searching, analyzing and displaying archived sites. The software isn’t licensed yet but according to the product’s website the plan is to make this software available to other organizations.  Used for the Library of Congress’ Election 2002 (http://lcweb4.loc.gov/elect2002/) and September 11 (http://september11.archive.org/)
web archives as well as the Asian Tsunami Web Archive (http://tsunami.archive.org/).

Post a Comment

Your email is never published nor shared. Required fields are marked *