Many web pages include a significant amount of content that isn't part of the main article or document - things like navigation menus, advertisements, sidebars, footers, and other distractions. The content extraction feature in GatherHub uses advanced algorithms and flexible processing pipelines to identify and extract just the meaningful content, greatly improving readability and making the content more usable for search and analysis.
GatherHub's content extraction feature supports:
The content extraction process follows these steps:
Content extraction is configured in the config.toml
file with three main sections:
# Global extraction settings [extraction] enabled = true output_dir = './downloads/extracted' output_formats = ['text', 'json', 'html'] include_metadata = true supported_types = ['html', 'books', 'documents'] # Individual extractors [[extractors]] name = 'readability' extensions = ['.html', '.htm'] type = 'internal' priority = 10 # Hook configuration [[event_hooks.hooks]] event = 'post_download' script = 'extract_content' enabled = true arguments = { output_formats = ["text", "json", "html"], output_dir = "./downloads/extracted" }
Option | Description | Default |
---|---|---|
enabled |
Enable or disable content extraction | true |
output_dir |
Directory where extracted content is saved | "./downloads/extracted" |
output_formats |
Array of output formats (text , json , html ) |
["text", "json"] |
include_metadata |
Whether to include metadata in the output | true |
supported_types |
Media types that should be processed for extraction | ["html", "books", "documents"] |
Extractors define how content is processed for specific file types:
Option | Description | Example |
---|---|---|
name |
Unique identifier for the extractor | readability |
extensions |
File extensions this extractor handles | [".html", ".htm"] |
type |
Extractor type: internal , external , or chain |
internal |
priority |
Priority value (higher values tried first) | 10 |
command |
External command to execute (for external type) | /usr/bin/pdftotext |
arguments |
Command arguments with placeholders | -layout "{input}" "{output}" |
steps |
Processing steps for chain extractors | [See chain extractor example] |
The hook configuration specifies when extraction is automatically triggered:
Option | Description | Default |
---|---|---|
event |
The event that triggers content extraction | 'post_download' |
script |
Hook script name | 'extract_content' |
enabled |
Enable or disable the hook | true |
arguments |
Hook-specific configuration options | { output_formats = ["text", "json"] } |
Comic book files (CBZ and CBR) support OCR text extraction using Tesseract. This feature extracts text from comic book pages, making the content searchable and indexable.
OCR behavior is configured in the [extraction.ocr]
section of config.toml
:
[extraction.ocr] enabled = true max_pages = 15 timeout_secs = 30
Option | Description | Default |
---|---|---|
enabled |
Enable or disable OCR text extraction | true |
max_pages |
Maximum number of pages to process with OCR | 15 |
timeout_secs |
Timeout in seconds for OCR processing per page | 30 |
max_pages
limitWhen OCR is successful, the extracted content will include:
Comic book with 24 pages: Page 1: [No text detected] Page 2: KNOWN ONLY AS CABLE! ... (more pages) ... and 9 more pages (OCR limited to first 15 pages)
enabled = true
in the [extraction.ocr]
sectionIn addition to comic book archives, GatherHub supports OCR text extraction from standalone image files. This feature allows you to extract text from screenshots, scanned documents, photos of text, infographics, and any other images containing readable text.
Image OCR uses the same configuration as comic book OCR in the [extraction.ocr]
section:
[[extractors]] name = 'image' extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp', '.svg', '.heif', '.heic'] type = 'internal' priority = 20 # OCR configuration applies to both comic books and images [extraction.ocr] enabled = true max_pages = 15 # For images, this is effectively 1 timeout_secs = 30 # Timeout per image
When processing a screenshot or document image, the output might look like:
{ "Title": "dashboard", "PlainText": "Welcome to GatherHub, a content downloader for offline data horders.\nJob Status: Pending, Downloading, Completed, Failed\nMedia Types: audio, books, documents...", "Metadata": { "Extractor": "image+ocr", "OCR_Engine": "tesseract", "Length": "66 words" }, "WordCount": 66, "SourcePath": "./screenshots/dashboard.png", "ContentType": "image/png" }
The configuration file includes commented examples showing alternative approaches to image processing:
For direct control over Tesseract parameters:
# Example: Alternative external image extractor using tesseract directly [[extractors]] name = 'image-external' extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp'] type = 'external' command = '/usr/bin/tesseract' arguments = '{input} {output} -l eng' priority = 15
For improved OCR results through image preprocessing:
# Example: Chain extractor for image preprocessing + OCR [[extractors]] name = 'image-chain' extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp'] type = 'chain' priority = 25 steps = [ { command = '/usr/bin/convert', args = '{input} -enhance -sharpen 0x1 {output}.enhanced.png' }, { command = '/usr/bin/tesseract', args = '{output}.enhanced.png {output} -l eng' } ]
These examples demonstrate how you can:
-l eng
The system supports three types of extractors:
Built-in code that processes specific file types.
[[extractors]] name = 'readability' extensions = ['.html', '.htm'] type = 'internal' priority = 10
Built-in extractors include:
readability
: HTML extraction using go-readabilityplaintext
: Plain text file processingpdf
: Basic PDF handlingdocx
: Document handlingepub
: Ebook handlingcbz
: Comic book ZIP archives with OCR text extractioncbr
: Comic book RAR archives with OCR text extractionUse external command-line tools for extraction.
[[extractors]] name = 'pdftotext' extensions = ['.pdf'] type = 'external' command = '/usr/bin/pdftotext' arguments = '-layout "{input}" "{output}"' temp_format = 'txt' priority = 20
External extractors allow you to use any command-line tool to process files:
{input}
as a placeholder for the input file path{output}
as a placeholder for the output file pathtemp_format
to define the temporary file extension if neededLink multiple processing steps together to handle complex formats.
[[extractors]] name = 'pdf-to-html-to-readability' extensions = ['.pdf'] type = 'chain' priority = 5 steps = [ { command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' }, { use = 'readability', input = '{output}.html' } ]
Chain extractors can:
You can specify which extractor to use for each media type:
[[media_types]] name = 'books' patterns = ['.*\.(epub|pdf|mobi|azw|azw3|cbz|cbr|fb2|lit)$'] tool = 'aria2c' tool_path = '/usr/bin/aria2c' arguments = '...' extractor = 'pdftotext' # Specify which extractor to use
This allows you to:
The extraction system supports three output formats:
Plain text format provides the cleanest, most readable version of the content. This is ideal for:
JSON format contains the complete extracted content with all metadata in a structured format. This is ideal for:
HTML format provides a clean, styled version of the content for web viewing. This is ideal for:
In addition to the automatic extraction via hooks, you can manually extract content using the command line:
# Process a single file gatherhub extract-content -input downloads/html/example.html -output downloads/extracted -format text # Process all HTML files in a directory gatherhub extract-content -input downloads/html -recursive -format json # Process a file and output in all formats (uses the default formats) gatherhub extract-content -input downloads/html/example.html -output downloads/extracted
Option | Description | Default |
---|---|---|
-input |
Path to input file or directory | (Required) |
-output |
Output directory for extracted content | ./downloads/extracted |
-format |
Output format (text, json, html) | text |
-recursive |
Process directories recursively | false |
-help |
Show help | false |
You can extend the system with your own extractors without modifying the code:
For tools like Calibre's ebook-convert:
[[extractors]] name = 'ebook-convert' extensions = ['.epub', '.mobi'] type = 'external' command = '/usr/bin/ebook-convert' arguments = '"{input}" "{output}.txt"' temp_format = 'txt' priority = 15
For complex multi-step processing:
[[extractors]] name = 'docx-to-pdf-to-text' extensions = ['.docx'] type = 'chain' priority = 5 steps = [ { command = '/usr/bin/libreoffice', args = '--convert-to pdf "{input}" --outdir "{output}"' }, { command = '/usr/bin/pdftotext', args = '"{output}.pdf" "{output}.txt"' } ]
You can define multiple extractors for the same file type with different priorities to create fallback chains:
# Primary PDF extractor [[extractors]] name = 'pdftotext' extensions = ['.pdf'] type = 'external' command = '/usr/bin/pdftotext' arguments = '-layout "{input}" "{output}"' priority = 20 # Fallback PDF extractor [[extractors]] name = 'alternative-pdf' extensions = ['.pdf'] type = 'external' command = '/usr/bin/alternate-pdf-tool' arguments = '"{input}" -o "{output}"' priority = 10
With this configuration, if the first extractor fails, the system will automatically try the second one.
The system extracts and preserves various metadata from documents:
Content extraction is valuable for many scenarios:
supported_types