Subscribers to this service can create distinct Web archives called "collections", containing only the content they are interested in harvesting, at whatever frequency suits their needs. All collections are full-text searchable. The collections created with Archive-It can be catalogued and managed directly by the subscriber. We keep a minimum of two copies of each collection online. None of these features are available in the general archive, accessed via the Wayback Machine.
Data collected in Archive-It crawls will be periodically indexed into the general Wayback Machine. Additionally, if a subscriber ends their service, their collections will be moved over to the general archive for public access via the Wayback Machine.
Archive-It is very flexible: you can harvest material from the Web as frequently as every 24 hours, once per week, once per month, once per quarter, annually or just one time. Subscribers can select different crawl frequencies for each chosen URL. Your institution can also chose to start a crawl "on demand" in the case of an unforseen spontaneous or historic event.
By default, all collections are available for public access via text search from the main page at www.Archive-It.org. However, a subscriber can choose to have their collection(s) made private by special arrangement.
Archive-It provides full text search capability for all public collections. Alternately if you know the site you are looking for, enter the URL into the search box, and Archive-It will search for instances of that archived URL.
Archive-It is designed to fit the needs of many types of organizations and individuals. The over 50 partners include: state archives, university libraries, federal institutions, state libraries, non government non profits, museums, historians, and independent researchers.
Subscribers develop their own collections and have complete control over which Web sites to archive within those collections
The terms of use are the same terms in place for the general Web archive at Internet Archive (http://www.archive.org/about/terms.php).
A site owner can prevent their site from being archived by putting a robots.txt exclusion in place. This will stop the site from being harvested and will eliminate access to any versions that may have been previously harvested using Archive-It. Siteowners can also request manual exclusions of their websites according to Internet Archive policy (http://www.archive.org/about/faqs.php#2)
Archive-It respects robots.txt. The crawler user agent used for Archive-It crawling is: archive.org_bot You can find robots.txt exclusion directions here (be sure to use archive.org_bot to prevent any Archive- It related crawling instead of ia_archiver). If you cannot place the robots.txt file, opt not to, or have further questions, email us at archive-it at archive dot org.
The Archive-It collections are curated Web collections developed by partner institutions. These collections are much more in depth and focused on a particular area than the general Web archive. In addition, Archive-It collections can be searched in full text, while the general Web archive cannot. The general web archive is a much broader snapshot of the Internet at a given point in time, and typically does not have the depth or frequency of capture as sites archived through the Archive-It service.
All data created using the Archive-It service is hosted and stored by the Internet Archive. We store two copies online and are working with partners to have redundant copies in other locations at the Bibliotheca Alexandrina in Egypt and other locations in the U.S. Subscribers can also request a copy of their data for local use and preservation either on a hard drive or over the internet.
If a partner chooses not to renew their subscription they can request a copy of their data, and their collection will be migrated to Internet Archive's general Web archive.
All data is in ARC file format. This is a non-proprietary storage format and an ISO work item. For more information on this format, please see the ARC/WARC format specification.
There is a set of open source tools that will allow you to read and index ARC files for text search at http://access-tools.sourceforge.net
An open source crawler Heritrix (http://crawler.archive.org) is being used to crawl Web sites for Archive-It. This crawler was developed in collaboration with the International Internet Preservation Consortium (IIPC) and is being used by memory institutions around the world.
To prevent your site being crawled by Heritrix, please use a robots.txt file. Specify archive.org_bot as your user agent.
The Internet Archive is a 501(c)(3) non-profit that was founded in 1996 to build an 'Internet library,' with the purpose of offering permanent access for researchers, historians, and scholars to collections that exist in digital format.
Alexa Internet has been crawling the web since 1996, which has resulted in a massive archive. If you have a web site, and you would like to ensure that it is saved for posterity in the Internet Archive, and you've searched wayback and found no results, you can visit the Alexa's "Webmasters" page at http://pages.alexa.com/help/webmasters/index.html#crawl_site.
Method 2: if you have the Alexa tool bar installed, just visit a site.
Method 3: while visiting a site, use the 'show related links' in Internet Explorer, which uses the Alexa service.
Sites are usually crawled within 24 hours and no more then 48. Right now there is a 6-12 month lag between the date a site is crawled and the date it appears in the Wayback Machine.
The Internet Archive Wayback Machine contains approximately 1 petabyte of data and is currently growing at a rate of 20 terabytes per month. This eclipses the amount of text contained in the world's largest libraries, including the Library of Congress. If you tried to place the entire contents of the archive onto floppy disks (we don't recommend this!) and laid them end to end, it would stretch from New York, past Los Angeles, and halfway to Hawaii.
All questions about the Wayback Machine, or other Internet Archive projects, should be addressed to info at archive dot org. You can contact the Archive-It team by emailing archive-it at archive.org.
The Internet Archive Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older version of your favorite Web site. The Internet Archive Wayback Machine can make all of this possible.