GatherHub Overview

GatherHub is a comprehensive web archiving and data collection system designed to help you save and organize online content from various sources. Whether for historical reasons, internet outage, or the end of the world; you'll have access to the data you deem important.

Use Cases

Personal Web Archiving

Create a personal archive of important web pages, articles, manuals, books, videos, repositories, and other resources you discover. Save content before it disappears from the internet.

Archive your browser bookmarks directly or from other tools automatically
Save complete web pages with all resources
Organize content with tags and find them with search

Research Collection

Build research collections by automatically downloading and organizing content from multiple sources. Create specialized archives for specific research topics.

Download academic papers and documents
Archive data sources and references
Organize by project or research area

Code Repository Backup

Keep backups of important GitHub, GitLab, and other repositories you depend on. Ensure you always have access to critical dependencies.

Clone repositories with full history
Schedule periodic updates
Organize and tag repositories by project

Media Collection

Collect and organize media content such as videos, images, audio, and documents from various online sources into a personal media library.

Download YouTube videos and playlists
Save image collections and galleries
Organize media by type, tags, and source

Core Features

Key Concept: GatherHub treats every download as a "job" that is tracked throughout its lifecycle, from discovery to archiving.

Browser Integration

Automatically scan and import bookmarks, browser history, and databases from Firefox, Chrome, and other browsers. Create a personal archive of your browsing activity.

Key Concept: GatherHub is not bookmark syncing tool. GatherHub ingests links from your sources so that it can download them.

Multi-Format Downloads

"Out of the box" support for various media types including HTML pages, PDFs, images, videos, audio, git repositories, and more - all with specialized storage and handling.

Key Concept: Change tool used and support for media types easily via the config.toml file or via the Settings interface.

Job Management

Track and manage download jobs through their entire lifecycle. Pause, resume, retry, and monitor progress through an intuitive web interface.

Key Concept: Everything is a job, with tracked history on the Job Detail pages. Retry all failed jobs from the main dashboard or individually.

Integrated Search

Full-text search across all indexed content including HTML, PDF, documents, and video metadata. Search by content, title, URL, tags, or any metadata field.

Key Concept: Content is automatically indexed when downloaded, or can be manually reindexed. The search system supports over 20 file types and includes specialized features for video content.

Tagging System

Organize your archive with a flexible tagging system. Categorize content and find it quickly with tag-based search and filtering.

Key Concept: Depending upon the source tags from the source may be ingested. Ad hoc tagging is availble to assign from the Job Detail pages.

API and Automation

RESTful API for integration with other tools. Supports automation with event hooks for post-processing and notification actions.

Key Concept: The system is integration friendly. The API server can be run independently from the web service. Event hooks offer a ton of extra functionality and can be written in any language.

Scheduling

Schedule automatic scanning and downloading. Run as a daemon or periodic service to keep your archive updated without manual intervention.

Key Concept: For testing you can manually process downloads. For ongoing usage it's advisable to run the daemon process.

System Architecture

GatherHub is built with a modular architecture consisting of several core components that work together:

Component	Description
Scanner	Discovers new URLs from configured sources such as browser bookmarks, history databases, or external sources.
Downloader	Handles the actual download process, selecting appropriate tools based on media type.
Scheduler	Manages when scans and downloads occur, with configurable intervals and conditions.
Job Queue	Maintains the state of all download jobs through their entire lifecycle.
Storage Manager	Organizes downloaded content by media type in the appropriate location.
Web Interface	Browser-based user interface for monitoring and managing the system.
API Server	REST API providing programmatic access to GatherHub functionality.
Event Hooks	Plugin system for custom post-processing and integration with external systems.

Getting Started

Ready to start using GatherHub? Here are the next steps:

Installation

Set up GatherHub on your system

Installation Guide

Quick Start

Get up and running quickly

Quick Start Guide

Configuration

Configure GatherHub for your needs

Configuration Guide