Web Archives

GatherHub's web archive feature creates comprehensive, offline-browsable archives of websites with automatic link rewriting and navigation indexes.

Note: Web archives are automatically created when you add a URL with the "web-archive" media type. The system will crawl the site, download all pages, and create a browsable offline archive.

Overview

Web archives in GatherHub go beyond simple page saving. They create complete, self-contained archives of websites that can be browsed offline exactly as they appeared online. The system automatically:

  • Crawls the website to discover all linked pages within the same domain
  • Downloads all pages with their resources (CSS, JavaScript, images)
  • Rewrites internal links to work in the offline archive
  • Creates a navigation index for easy browsing
  • Generates metadata about the archive contents

How It Works

1. URL Discovery

The system starts with your provided URL and crawls the website to discover all internal links. It respects domain boundaries and configurable depth limits.

2. Content Download

Each discovered page is downloaded using monolith or single-file, preserving all resources and ensuring pages display correctly offline.

3. Link Rewriting

Internal links are automatically rewritten to point to the local archived files, maintaining the site's navigation structure.

4. Index Generation

A comprehensive index page is created, listing all archived pages with their titles and providing easy navigation.

Archive Structure

Web archives are organized in a clear, hierarchical structure:

web-archive/
├── archive_summary.txt     # Archive metadata and statistics
├── index.html             # Main navigation index
└── domain.com/            # Domain-specific directory
    ├── index.html         # Site homepage
    ├── about/
    │   └── index.html     # About page
    ├── products/
    │   ├── index.html     # Products listing
    │   └── item-1/
    │       └── index.html # Individual product page
    └── contact/
        └── index.html     # Contact page

Configuration

Web archive behavior can be customized through the media type configuration:

Basic Configuration


                    [[media_types]]
                    name = 'web-archive'
                    extensions = []
                    domains = []
                    tool = 'internal'
                    tool_path = ''
                    arguments = '--max-depth=2 --rate-limit=2000 --tool=auto --user-agent="GatherHub-Archiver/1.0"'
                    extractor = 'readability'

                    

Configuration Options

Option Default Description
--max-depth 2 Maximum crawl depth from the starting URL
--rate-limit 2000 Delay between requests in milliseconds
--user-agent "GatherHub-WebArchive/1.0" User agent string for requests
--verbose false Enable detailed logging during crawl

Archive Features

Link Rewriting

One of the most powerful features of GatherHub's web archives is automatic link rewriting. When pages are archived, all internal links are automatically updated to point to the local archived versions:

  • Absolute URLs like https://example.com/about become example.com/about/index.html
  • Relative URLs are preserved and work correctly within the archive structure
  • Fragment links (anchors) continue to work within pages
  • External links are left unchanged and will open in the browser when clicked

Navigation Index

Every web archive includes a comprehensive navigation index (index.html) that provides:

  • Archive overview with creation date and statistics
  • Complete page listing with titles and descriptions
  • Hierarchical organization reflecting the site structure
  • Search functionality for finding specific pages
  • Archive metadata including crawl settings and results

Metadata Preservation

The archive process preserves important metadata about each page:

  • Page titles extracted from HTML title tags
  • URLs both original and archived locations
  • File sizes for storage management
  • Crawl timestamps for version tracking
  • Link relationships between pages

Best Practices

Choosing Crawl Depth

Important: Be careful with crawl depth settings. Large sites can generate thousands of pages, consuming significant storage space and bandwidth.
  • Depth 1: Only the starting page and its direct links
  • Depth 2: Includes pages linked from the direct links (recommended for most sites)
  • Depth 3+: Use cautiously, can result in very large archives

Rate Limiting

Always use appropriate rate limiting to be respectful to the target website:

  • 2000ms (2 seconds): Conservative, good for most sites
  • 1000ms (1 second): Moderate, suitable for robust sites
  • 500ms or less: Use only for your own sites or with permission

Storage Considerations

  • Monitor archive sizes: Web archives can grow large quickly
  • Use selective archiving: Consider archiving specific sections rather than entire sites
  • Regular cleanup: Remove outdated archives to manage storage

Viewing Archives

Web archives can be viewed in several ways:

Through GatherHub Web Interface

  1. Navigate to the completed job in the Jobs page
  2. Click on the job to view its details
  3. In the Files section, click on index.html to open the archive
  4. Browse the archive using the navigation index

Direct File Access

  1. Navigate to the archive directory in your file system
  2. Open index.html in any web browser
  3. Use the navigation index to explore the archived site

Archive URLs

When viewing through GatherHub's web interface, archive files are accessible via URLs like:

  • http://localhost:8060/file/{job_id}/index.html - Main archive index
  • http://localhost:8060/file/{job_id}/domain.com/page/index.html - Specific archived page

Troubleshooting

Common Issues

  • Check the crawl depth setting - increase if needed
  • Verify that pages are properly linked (some pages may not be discoverable)
  • Check for JavaScript-heavy sites that may require different archiving tools
  • Review the archive summary for any errors during crawling

  • Ensure you're viewing the archive through a web server (not file:// URLs)
  • Check that the target pages were actually archived
  • Verify that the link rewriting process completed successfully
  • Some dynamic links may not be rewritable and will remain as external links

  • Reduce the crawl depth to limit the number of pages
  • Increase the rate limit to be more respectful to the target site
  • Check if the site blocks automated crawling (robots.txt, rate limiting)
  • Verify that monolith or single-file tools are properly installed
  • Check network connectivity and DNS resolution

Advanced Usage

Custom Archiving Tools

While GatherHub includes built-in web archiving capabilities, you can also configure it to use external tools:

[[media_types]]
name = 'custom-web-archive'
patterns = ['^https?://special-site\.com/.*']
tool = 'wget'
tool_path = '/usr/bin/wget'
arguments = '--recursive --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {url}'

Selective Archiving

For large sites, consider creating multiple targeted archives instead of one comprehensive archive:

  • Section-specific archives: Archive /docs/, /blog/, etc. separately
  • Time-based archives: Archive recent content with higher frequency
  • Priority-based archives: Archive important pages with deeper crawling

Integration with Other Tools

Web archives work well with other GatherHub features:

  • Content Extraction: Archived pages are automatically indexed for search
  • Tagging: Tag archives by topic, importance, or date
  • Event Hooks: Trigger notifications or processing when archives complete
  • Scheduling: Automatically update archives on a schedule
Tip: Web archives are perfect for preserving documentation, research materials, or any website content that might change or disappear over time. They provide a complete, self-contained snapshot that can be browsed offline indefinitely.
Search Results

Type to search documentation...