GatherHub is a comprehensive web archiving and data collection system designed to help you save and organize online content from various sources. Whether for historical reasons, internet outage, or the end of the world; you'll have access to the data you deem important.
Create a personal archive of important web pages, articles, manuals, books, videos, repositories, and other resources you discover. Save content before it disappears from the internet.
Build research collections by automatically downloading and organizing content from multiple sources. Create specialized archives for specific research topics.
Keep backups of important GitHub, GitLab, and other repositories you depend on. Ensure you always have access to critical dependencies.
Collect and organize media content such as videos, images, audio, and documents from various online sources into a personal media library.
Automatically scan and import bookmarks, browser history, and databases from Firefox, Chrome, and other browsers. Create a personal archive of your browsing activity.
"Out of the box" support for various media types including HTML pages, PDFs, images, videos, audio, git repositories, and more - all with specialized storage and handling.
Track and manage download jobs through their entire lifecycle. Pause, resume, retry, and monitor progress through an intuitive web interface.
Full-text search across all indexed content including HTML, PDF, documents, and video metadata. Search by content, title, URL, tags, or any metadata field.
Organize your archive with a flexible tagging system. Categorize content and find it quickly with tag-based search and filtering.
RESTful API for integration with other tools. Supports automation with event hooks for post-processing and notification actions.
Schedule automatic scanning and downloading. Run as a daemon or periodic service to keep your archive updated without manual intervention.
GatherHub is built with a modular architecture consisting of several core components that work together:
Component | Description |
---|---|
Scanner | Discovers new URLs from configured sources such as browser bookmarks, history databases, or external sources. |
Downloader | Handles the actual download process, selecting appropriate tools based on media type. |
Scheduler | Manages when scans and downloads occur, with configurable intervals and conditions. |
Job Queue | Maintains the state of all download jobs through their entire lifecycle. |
Storage Manager | Organizes downloaded content by media type in the appropriate location. |
Web Interface | Browser-based user interface for monitoring and managing the system. |
API Server | REST API providing programmatic access to GatherHub functionality. |
Event Hooks | Plugin system for custom post-processing and integration with external systems. |
Ready to start using GatherHub? Here are the next steps: