Web Archives

GatherHub's web archive feature creates comprehensive, offline-browsable archives of websites with automatic link rewriting and navigation indexes.

Note: Web archives are automatically created when you add a URL with the "web-archive" media type. The system will crawl the site, download all pages, and create a browsable offline archive.

Overview

Web archives in GatherHub go beyond simple page saving. They create complete, self-contained archives of websites that can be browsed offline exactly as they appeared online. The system automatically:

Crawls the website to discover all linked pages within the same domain
Downloads all pages with their resources (CSS, JavaScript, images)
Rewrites internal links to work in the offline archive
Creates a navigation index for easy browsing
Generates metadata about the archive contents

How It Works

1. URL Discovery

The system starts with your provided URL and crawls the website to discover all internal links. It respects domain boundaries and configurable depth limits.

2. Content Download

Each discovered page is downloaded using monolith or single-file, preserving all resources and ensuring pages display correctly offline.

3. Link Rewriting

Internal links are automatically rewritten to point to the local archived files, maintaining the site's navigation structure.

4. Index Generation

A comprehensive index page is created, listing all archived pages with their titles and providing easy navigation.

Archive Structure

Web archives are organized in a clear, hierarchical structure:

web-archive/
├── archive_summary.txt     # Archive metadata and statistics
├── index.html             # Main navigation index
└── domain.com/            # Domain-specific directory
    ├── index.html         # Site homepage
    ├── about/
    │   └── index.html     # About page
    ├── products/
    │   ├── index.html     # Products listing
    │   └── item-1/
    │       └── index.html # Individual product page
    └── contact/
        └── index.html     # Contact page

Configuration

Web archive behavior can be customized through the media type configuration:

Basic Configuration


                    [[media_types]]
                    name = 'web-archive'
                    extensions = []
                    domains = []
                    tool = 'internal'
                    tool_path = ''
                    arguments = '--max-depth=2 --rate-limit=2000 --tool=auto --user-agent="GatherHub-Archiver/1.0"'
                    extractor = 'readability'

Configuration Options

Option	Default	Description
`--max-depth`	2	Maximum crawl depth from the starting URL
`--rate-limit`	2000	Delay between requests in milliseconds
`--user-agent`	"GatherHub-WebArchive/1.0"	User agent string for requests
`--verbose`	false	Enable detailed logging during crawl

Archive Features

Link Rewriting

One of the most powerful features of GatherHub's web archives is automatic link rewriting. When pages are archived, all internal links are automatically updated to point to the local archived versions:

Absolute URLs like https://example.com/about become example.com/about/index.html
Relative URLs are preserved and work correctly within the archive structure
Fragment links (anchors) continue to work within pages
External links are left unchanged and will open in the browser when clicked

Navigation Index

Every web archive includes a comprehensive navigation index (index.html) that provides:

Archive overview with creation date and statistics
Complete page listing with titles and descriptions
Hierarchical organization reflecting the site structure
Search functionality for finding specific pages
Archive metadata including crawl settings and results

Metadata Preservation

The archive process preserves important metadata about each page:

Page titles extracted from HTML title tags
URLs both original and archived locations
File sizes for storage management
Crawl timestamps for version tracking
Link relationships between pages

Best Practices

Choosing Crawl Depth

Important: Be careful with crawl depth settings. Large sites can generate thousands of pages, consuming significant storage space and bandwidth.

Depth 1: Only the starting page and its direct links
Depth 2: Includes pages linked from the direct links (recommended for most sites)
Depth 3+: Use cautiously, can result in very large archives

Rate Limiting

Always use appropriate rate limiting to be respectful to the target website:

2000ms (2 seconds): Conservative, good for most sites
1000ms (1 second): Moderate, suitable for robust sites
500ms or less: Use only for your own sites or with permission

Storage Considerations

Monitor archive sizes: Web archives can grow large quickly
Use selective archiving: Consider archiving specific sections rather than entire sites
Regular cleanup: Remove outdated archives to manage storage

Viewing Archives

Web archives can be viewed in several ways:

Through GatherHub Web Interface

Navigate to the completed job in the Jobs page
Click on the job to view its details
In the Files section, click on index.html to open the archive
Browse the archive using the navigation index

Direct File Access

Navigate to the archive directory in your file system
Open index.html in any web browser
Use the navigation index to explore the archived site

Archive URLs

When viewing through GatherHub's web interface, archive files are accessible via URLs like:

http://localhost:8060/file/{job_id}/index.html - Main archive index
http://localhost:8060/file/{job_id}/domain.com/page/index.html - Specific archived page

Troubleshooting

Common Issues

Check the crawl depth setting - increase if needed
Verify that pages are properly linked (some pages may not be discoverable)
Check for JavaScript-heavy sites that may require different archiving tools
Review the archive summary for any errors during crawling

Ensure you're viewing the archive through a web server (not file:// URLs)
Check that the target pages were actually archived
Verify that the link rewriting process completed successfully
Some dynamic links may not be rewritable and will remain as external links

Reduce the crawl depth to limit the number of pages
Increase the rate limit to be more respectful to the target site
Check if the site blocks automated crawling (robots.txt, rate limiting)
Verify that monolith or single-file tools are properly installed
Check network connectivity and DNS resolution

Advanced Usage

Custom Archiving Tools

While GatherHub includes built-in web archiving capabilities, you can also configure it to use external tools:

[[media_types]]
name = 'custom-web-archive'
patterns = ['^https?://special-site\.com/.*']
tool = 'wget'
tool_path = '/usr/bin/wget'
arguments = '--recursive --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {url}'

Selective Archiving

For large sites, consider creating multiple targeted archives instead of one comprehensive archive:

Section-specific archives: Archive /docs/, /blog/, etc. separately
Time-based archives: Archive recent content with higher frequency
Priority-based archives: Archive important pages with deeper crawling

Integration with Other Tools

Web archives work well with other GatherHub features:

Content Extraction: Archived pages are automatically indexed for search
Tagging: Tag archives by topic, importance, or date
Event Hooks: Trigger notifications or processing when archives complete
Scheduling: Automatically update archives on a schedule

Tip: Web archives are perfect for preserving documentation, research materials, or any website content that might change or disappear over time. They provide a complete, self-contained snapshot that can be browsed offline indefinitely.