GatherHub organizes downloaded content in a structured file system. This page explains how storage is configured, how files are organized, and how to customize the storage settings. File location can be on the same system as GatherHub or on network storage the system is connected to.
The storage configuration is defined in the config.toml
file under the [storage]
section,
below is the default configuration:
[storage]
base_path = "./downloads"
[storage.by_type]
archives = 'archives/'
audio = 'audio/'
books = 'books/'
documents = 'documents/'
git = 'git/'
html = 'html/'
images = 'images/'
maps = 'maps/'
others = 'others/'
streaming-videos = 'streaming-video/'
torrents = 'torrents/'
videos = 'video/'
web-archives = 'web-archive/'
zims = 'zim/'
GatherHub organizes downloaded content by media type. The default structure is:
downloads/
├── archives/ # ZIP, TAR, and other archives
├── audio/ # MP3, FLAC, OGG, etc
├── books/ # EPUB, MOBI, AZW, etc
├── documents/ # Docs, XLSX, PPT and other documents
├── git/ # GIT Repositories
├── html/ # Single HTML pages
├── images/ # Images
├── maps/ # Map data
├── others/ # Unrecognized content
├── streaming-video/ # Youtube, Vimeo, etc
├── torrents # Magnet & Torrent files
├── video/ # MP4, MOV, etc. Non-streaming
├── web-archive/ # Crawled site HTML pages
└── zim/ # ZIM archives (Wikipedia, etc.)
Within each media type directory, files are stored with names derived from their sources.
You can customize where different media types are stored by modifying the storage.by_type
configuration:
The base_path
setting determines the root directory for all downloads. By default, it's ./downloads
,
which is relative to GatherHub's working directory. You can change this to an absolute path:
[storage]
base_path = "/mnt/external-drive/gatherhub-data"
Each media type can be stored in a custom location. The values in storage.by_type
are relative
to the base_path
:
[storage.by_type]
youtube = "videos/youtube" # Will store in /mnt/external-drive/gatherhub-data/videos/youtube
git = "/opt/repositories" # Will store in /opt/repositories (absolute path)
base_path
setting.
Different media types have specific storage behaviors:
Media Type | Storage Format | File Naming |
---|---|---|
HTML | Complete webpage with resources | Domain-based directory with timestamp |
YouTube | Video file + metadata | Video ID or title-based filename |
Git | Cloned repository + ZIP archive | username_repository format |
Images | Original image files | Original filename or hash-based name |
Documents | PDF, DOCX, etc. in original format | Original filename or derived from title |
Besides the media-specific directories, GatherHub creates and uses several special files:
cookies.txt
: Stored in the base path, contains YouTube authentication cookies.gitconfig
: Stored in the git directory, contains Git configuration for repository cloning*.zip
: For Git repositories, ZIP archives created for easy download through the web interfaceGatherHub sets appropriate permissions for downloaded files:
0644
(read/write for owner, read for group/others)0755
(read/write/execute for owner, read/execute for group/others)0600
(read/write for owner only)GatherHub doesn't currently enforce disk space limits, so it's important to monitor available space, especially when downloading large media like YouTube videos or Git repositories.
The auto-clean feature can help manage storage by removing old job records from the database:
[auto_clean]
enabled = true
retry_failed = true
max_retries = 3
clean_after_days = 30
The auto-clean functionality has two independent components:
max_retries
).clean_after_days
).enabled = true
and clean_after_days = 0
. A value of 0 for clean_after_days
will disable the cleanup of completed jobs while still allowing failed jobs to be cleaned up.
Note that auto-clean only removes job records from the database, not the actual downloaded files. Files remain on disk even after their job records are removed.
To manage disk space, you can manually delete files from the media-type directories. GatherHub uses the database to track downloads, so deleting files won't affect the system's operation, but you won't be able to access those files through the web interface anymore.
You can check storage usage using standard system commands:
# Total usage of the downloads directory
du -sh ./downloads/
# Usage by media type
du -sh ./downloads/*/
You can change storage settings through:
Edit config.toml
and restart GatherHub
Go to Settings > Storage to modify settings through the UI