WebliminalBlog : Links for Web Archiving

Possible Applications to Try:

Information about Website compression:

Tools:

WebÂ tool that examines the size gains of website compression.Â DownloadableÂ Website Analyser

http://webarchivist.org/resources.htm

Below: SomeÂ annotated text taken from Web Archiving Resources Office for Information Systems, Harvard University Libraries

HarvestingÂ Services

ArchiveIt

A subscription harvesting serviceÂ provided by the Internet Archive. Through a web based interface, usersÂ can capture, catalogue and archive their institution’s own web site orÂ build additional collections, and then search and browse the collectionÂ when complete.
http://www.archive-it.org/

HarvestingÂ Software

Open source harvestingÂ software

Combine Harvesting Robot
http://www.lub.lu.se/combine/
Harvesting and indexing software written in Perl and C++ and under theÂ GPL license. Once used (and still?) by Swedish, Danish and Austrian archives.Â Do not know if this is actively developed anymore.

GNU Wget
http://www.gnu.org/software/wget/wget.html
A non-interactive command-line tool under the GPL license that can beÂ used from scripts and other programs.

Heritrix, Internet Archive andÂ Nordic National Libraries
http://crawler.archive.org/
A robust web archiving harvester under the LGPL license. Has very flexibleÂ means to configure and control the harvest. Designed to be extensibleÂ by writing new Java modules. Configurable through a web interface. ThisÂ work is sponsored by the IIPC (International Internet Preservation Consortium).

HTTrack
http://www.httrack.com/
Offline browsers under the GPL license that can be used from a graphicalÂ interface or the command line.

Nalanda iVia Focused CrawlerÂ (NIFC)

http://ivia.ucr.edu/projects/Nalanda/

Designed to find Web resourcesÂ with the same topic as a seed set of known resources. NIFC was createdÂ by Dr. Soumen ChakrabartiÂ at the Indian Institute of Technology (Bombay), and further developedÂ in collaboration with the iVia team.

Nedlib Harvester, Center forÂ Scientific Computing – the Finnish IT Center for Science
http://www.csc.fi/sovellus/nedlib/
Developed as a part of the Nedlib project funded by the European Union.Â Written in C and dependent on the MySQL database. No longer supportedÂ or developed.

Commercial harvestingÂ software

Internet Researcher, Zylox Software
http://www.zylox.com/
A Windows-only offline browsing tool with a graphical interface.

Offline Explorer and Mass Downloader,Â MetaProducts Software Corporation
http://www.metaproducts.com/mp/mpProducts_List.asp
Various offline browsing tools for Windows. The MetaProducts Offline ExplorerÂ Pro 2.1 is used by DACHS (Digital Archive for Chinese Studies) -Â http://www.sino.uni-heidelberg.de/dachs/

RafaBot, Spadix Software
http://www.spadixbd.com/rafabot/
A Windows-only offline browsing tool with a graphical interface. As wellÂ as supplying it with a list of URLs, can give it search terms and RafaBotÂ will download all matching web sites using search engines.

SuperBot, Sparkleware
http://www.sparkleware.com/dl.html
A simple Windows-only offline browsing tool with a graphical interface.

SurfSaver, askSam Systems
http://www.surfsaver.com/
An add-on to Microsoft Internet Explorer.

Teleport Webspiders
http://www.tenmax.com/teleport/home.htm
Various sophisticated Windows-only versions with different interfacesÂ (graphical, console, scriptable) and feature sets.

WebCopier, MaximumSoft Corp.
http://www.maximumsoft.com/index.html
An offline browsing tool with a graphical interface in multiple versionsÂ for different operating systems and performance/feature level.

Discovery,Â Display and Access Software

ARC Access Tools
http://archive-access.sourceforge.net/
Internet Archive’s list of tools for processing and accessing contentÂ in ARC files.

Kea

http://www.nzdl.org/Kea/

A GPLed tool for automatic keywordÂ extraction from text documents. Originally written in a combination ofÂ Perl, C and Java; now available in an all-Java version. From the New Zealand
Digital Library at the University of Waikato, New Zealand.

libiViaMetadata

http://ivia.ucr.edu/manuals/libiViaMetadata/current/

A GPLed C++ library for assigningÂ descriptive metadata to web files. Developed under the iVia Project. IncludesÂ the PhraseRate program which is described atÂ http://ivia.ucr.edu/projects/PhraseRate/

NutchWAX (Nutch + Web ArchiveÂ eXtensions), Internet Archive and Nordic National Libraries
http://archive-access.sourceforge.net/projects/nutch/gettingstarted.html
A tool for indexing and searching web archives. Currently works only withÂ the Arc format
(http://www.archive.org/web/researcher/ArcFileFormat.php).
Implemented as a Java servlet. Add parsers to handle different formats,Â e.g. xpdf for PDF files. This work is sponsored by the IIPC (InternationalÂ Internet Preservation Consortium).

Wayback, Internet Archive

http://archive-access.sourceforge.net/projects/wayback/

The open source version of theÂ Internet Archive’s proprietary search and display interface, the “WaybackÂ Machine” (listed next).

Wayback Machine, Internet Archive
http://www.archive.org/web/web.php
A proprietary interface to the Internet Archive’s huge collection of webÂ pages archived from 1996 to the present.

WERA (Web Archive Access), Internet
Archive and National Library of NorwayÂ http://nwa.nb.no/
An archive viewer application that gives an Internet Archive Wayback Machine-likeÂ access to web archive collections as well as the possibility to do fullÂ text search and easy navigation between different versions of a web page.Â WERA is based on, and replaces the NwaToolset. It uses the NutchWAX searchÂ engine and is written in PHP and Java. This work is sponsoredÂ by the IIPC (International Internet Preservation Consortium).

GeneralÂ Web Archiving Suites

Software that is moreÂ of a system of web archiving tools rather that individual applications

DataFountains

http://ivia.ucr.edu/manuals/DataFountains/1.0.0/

A tool for discovering, harvesting
and describing web resources. Developed under the iVia Project.

PANDAS (PANDORA Digital Archiving
System), National Library of Australia http://pandora.nla.gov.au/pandas.htmlÂ Tools for controlling the harvest, conducting quality assurance checking,Â initiating archiving processes, managing the metadata including accessÂ restrictions, and producing management reports. Uses the HTTrack harvester.Â PANDAS was created to enable very selective harvesting and is not intendedÂ for large-scale automated harvests. The developers of this software areÂ re-engineering PANDAS to use IIPC tools like Heretrix and WERA, and to
be better integrated with their digital repository.

WebArchivist Software Suite, SUNY Institute ofÂ Technology and University of Washington
http://www.webarchivist.org/resources.htm
Tools for entering metadata, searching, analyzing and displaying archivedÂ sites. The software isn’t licensed yet but according to the product’sÂ website the plan is to make this software available to other organizations. Â Used for the Library of Congress’ Election 2002 (http://lcweb4.loc.gov/elect2002/)Â and September 11 (http://september11.archive.org/)
web archives as well as the Asian Tsunami Web Archive (http://tsunami.archive.org/).

WebliminalBlog

Links for Web Archiving

Possible Applications to Try:

Information about Website compression:

Tools:

HarvestingÂ Services

HarvestingÂ Software

Discovery,Â Display and Access Software

GeneralÂ Web Archiving Suites

Post a Comment

Today’s Special

Will you be buying something at Amazon?

Search

Archives

Categories

Blogroll

WebliminalBlog

Links for Web Archiving

Possible Applications to Try:

Information about Website compression:

Tools:

HarvestingÂ Services

HarvestingÂ Software

Discovery,Â Display and Access Software

GeneralÂ Web Archiving Suites

Post a Comment

Today’s Special

Will you be buying something at Amazon?

Search

Archives

Categories

Tags

Blogroll