GatherHub's web archive feature creates comprehensive, offline-browsable archives of websites with automatic link rewriting and navigation indexes.
Web archives in GatherHub go beyond simple page saving. They create complete, self-contained archives of websites that can be browsed offline exactly as they appeared online. The system automatically:
The system starts with your provided URL and crawls the website to discover all internal links. It respects domain boundaries and configurable depth limits.
Each discovered page is downloaded using monolith or single-file, preserving all resources and ensuring pages display correctly offline.
Internal links are automatically rewritten to point to the local archived files, maintaining the site's navigation structure.
A comprehensive index page is created, listing all archived pages with their titles and providing easy navigation.
Web archives are organized in a clear, hierarchical structure:
web-archive/
├── archive_summary.txt # Archive metadata and statistics
├── index.html # Main navigation index
└── domain.com/ # Domain-specific directory
├── index.html # Site homepage
├── about/
│ └── index.html # About page
├── products/
│ ├── index.html # Products listing
│ └── item-1/
│ └── index.html # Individual product page
└── contact/
└── index.html # Contact page
Web archive behavior can be customized through the media type configuration:
[[media_types]]
name = 'web-archive'
extensions = []
domains = []
tool = 'internal'
tool_path = ''
arguments = '--max-depth=2 --rate-limit=2000 --tool=auto --user-agent="GatherHub-Archiver/1.0"'
extractor = 'readability'
Option | Default | Description |
---|---|---|
--max-depth |
2 | Maximum crawl depth from the starting URL |
--rate-limit |
2000 | Delay between requests in milliseconds |
--user-agent |
"GatherHub-WebArchive/1.0" | User agent string for requests |
--verbose |
false | Enable detailed logging during crawl |
One of the most powerful features of GatherHub's web archives is automatic link rewriting. When pages are archived, all internal links are automatically updated to point to the local archived versions:
https://example.com/about
become example.com/about/index.html
Every web archive includes a comprehensive navigation index (index.html
) that provides:
The archive process preserves important metadata about each page:
Always use appropriate rate limiting to be respectful to the target website:
Web archives can be viewed in several ways:
index.html
to open the archiveindex.html
in any web browserWhen viewing through GatherHub's web interface, archive files are accessible via URLs like:
http://localhost:8060/file/{job_id}/index.html
- Main archive indexhttp://localhost:8060/file/{job_id}/domain.com/page/index.html
- Specific archived pageWhile GatherHub includes built-in web archiving capabilities, you can also configure it to use external tools:
[[media_types]]
name = 'custom-web-archive'
patterns = ['^https?://special-site\.com/.*']
tool = 'wget'
tool_path = '/usr/bin/wget'
arguments = '--recursive --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {url}'
For large sites, consider creating multiple targeted archives instead of one comprehensive archive:
Web archives work well with other GatherHub features: