Storage Settings

GatherHub organizes downloaded content in a structured file system. This page explains how storage is configured, how files are organized, and how to customize the storage settings. File location can be on the same system as GatherHub or on network storage the system is connected to.

Storage Configuration

The storage configuration is defined in the config.toml file under the [storage] section, below is the default configuration:

[storage]
base_path = "./downloads"

[storage.by_type]
archives = 'archives/'
audio = 'audio/'
books = 'books/'
documents = 'documents/'
git = 'git/'
html = 'html/'
images = 'images/'
maps = 'maps/'
others = 'others/'
streaming-videos = 'streaming-video/'
torrents = 'torrents/'
videos = 'video/'
web-archives = 'web-archive/'
zims = 'zim/'

Directory Structure

GatherHub organizes downloaded content by media type. The default structure is:

downloads/
├── archives/           # ZIP, TAR, and other archives
├── audio/              # MP3, FLAC, OGG, etc
├── books/              # EPUB, MOBI, AZW, etc
├── documents/          # Docs, XLSX, PPT  and other documents
├── git/                # GIT Repositories 
├── html/               # Single HTML pages
├── images/             # Images 
├── maps/               # Map data 
├── others/             # Unrecognized content
├── streaming-video/    # Youtube, Vimeo, etc
├── torrents            # Magnet & Torrent files 
├── video/              # MP4, MOV, etc. Non-streaming 
├── web-archive/        # Crawled site HTML pages 
└── zim/                # ZIM archives (Wikipedia, etc.)

Within each media type directory, files are stored with names derived from their sources.

Path Configuration

You can customize where different media types are stored by modifying the storage.by_type configuration:

Changing the Base Path

The base_path setting determines the root directory for all downloads. By default, it's ./downloads, which is relative to GatherHub's working directory. You can change this to an absolute path:

[storage]
base_path = "/mnt/external-drive/gatherhub-data"

Changing Media Type Directories

Each media type can be stored in a custom location. The values in storage.by_type are relative to the base_path:

[storage.by_type]
youtube = "videos/youtube"  # Will store in /mnt/external-drive/gatherhub-data/videos/youtube
git = "/opt/repositories"   # Will store in /opt/repositories (absolute path)

Tip: If you provide an absolute path (starting with "/"), it will be used directly, ignoring the base_path setting.

Storage Behavior by Media Type

Different media types have specific storage behaviors:

Media Type	Storage Format	File Naming
HTML	Complete webpage with resources	Domain-based directory with timestamp
YouTube	Video file + metadata	Video ID or title-based filename
Git	Cloned repository + ZIP archive	username_repository format
Images	Original image files	Original filename or hash-based name
Documents	PDF, DOCX, etc. in original format	Original filename or derived from title

Special Files

Besides the media-specific directories, GatherHub creates and uses several special files:

cookies.txt: Stored in the base path, contains YouTube authentication cookies
.gitconfig: Stored in the git directory, contains Git configuration for repository cloning
*.zip: For Git repositories, ZIP archives created for easy download through the web interface

File Permissions

GatherHub sets appropriate permissions for downloaded files:

Regular content files: 0644 (read/write for owner, read for group/others)
Directories: 0755 (read/write/execute for owner, read/execute for group/others)
Sensitive files (like cookies.txt): 0600 (read/write for owner only)
Executable files (like cloned scripts): preserves original executable permissions

Storage Management

Disk Space

GatherHub doesn't currently enforce disk space limits, so it's important to monitor available space, especially when downloading large media like YouTube videos or Git repositories.

Auto-Clean Feature

The auto-clean feature can help manage storage by removing old job records from the database:

[auto_clean]
enabled = true
retry_failed = true
max_retries = 3
clean_after_days = 30

The auto-clean functionality has two independent components:

Failed Job Cleanup: Removes failed jobs that have exceeded the maximum retry attempts (controlled by max_retries).
Completed Job Cleanup: Removes completed jobs older than the specified number of days (controlled by clean_after_days).

Tip: To clean only failed jobs while preserving all completed jobs, set enabled = true and clean_after_days = 0. A value of 0 for clean_after_days will disable the cleanup of completed jobs while still allowing failed jobs to be cleaned up.

Note that auto-clean only removes job records from the database, not the actual downloaded files. Files remain on disk even after their job records are removed.

Manual Cleanup

To manage disk space, you can manually delete files from the media-type directories. GatherHub uses the database to track downloads, so deleting files won't affect the system's operation, but you won't be able to access those files through the web interface anymore.

Checking Storage Usage

You can check storage usage using standard system commands:

# Total usage of the downloads directory
du -sh ./downloads/

# Usage by media type
du -sh ./downloads/*/

Changing Storage Settings

You can change storage settings through:

Configuration File:
Edit config.toml and restart GatherHub
Web Interface:
Go to Settings > Storage to modify settings through the UI

Warning: Changing storage paths does not automatically move existing files. If you change paths, you may need to manually move files to the new locations.