Content Extraction

The content extraction feature helps you extract clean, readable content from HTML and other document types, removing clutter, navigation elements, and ads for a better reading experience and more effective indexing.

Overview

Many web pages include a significant amount of content that isn't part of the main article or document - things like navigation menus, advertisements, sidebars, footers, and other distractions. The content extraction feature in GatherHub uses advanced algorithms and flexible processing pipelines to identify and extract just the meaningful content, greatly improving readability and making the content more usable for search and analysis.

Supported Document Types

GatherHub's content extraction feature supports:

  • HTML/HTM: Full support with go-readability for extracting article content
  • TXT: Basic processing and word count
  • PDF: Support via configurable extractors (can use multiple approaches)
  • DOCX/DOC: Support via configurable extractors
  • EPUB: Support via configurable extractors
  • CBZ/CBR: Comic book archives with OCR text extraction (requires Tesseract)
  • Images: JPG, PNG, GIF, BMP, TIFF, WebP, SVG, HEIF, HEIC with OCR text extraction (requires Tesseract)
  • Any Format: Support for any format can be added with external tools or processing chains
Extensible System: The extraction system is highly extensible and can support any document format by configuring appropriate extractors.

How It Works

The content extraction process follows these steps:

  1. When a file is downloaded (or processed manually), GatherHub identifies its format based on the file extension
  2. The system finds all matching extractors for that extension
  3. Extractors are prioritized based on their configured priority values
  4. Each extractor is tried in order until one succeeds
  5. The extracted content is saved in all configured output formats to the specified output directory
  6. Metadata like title, author, word count, and publish date are preserved when available

Configuration

Content extraction is configured in the config.toml file with three main sections:

# Global extraction settings
[extraction]
enabled = true
output_dir = './downloads/extracted'
output_formats = ['text', 'json', 'html']
include_metadata = true
supported_types = ['html', 'books', 'documents']

# Individual extractors
[[extractors]]
name = 'readability'
extensions = ['.html', '.htm']
type = 'internal'
priority = 10

# Hook configuration
[[event_hooks.hooks]]
event = 'post_download'
script = 'extract_content'
enabled = true
arguments = { output_formats = ["text", "json", "html"], output_dir = "./downloads/extracted" }

Global Extraction Settings

Option Description Default
enabled Enable or disable content extraction true
output_dir Directory where extracted content is saved "./downloads/extracted"
output_formats Array of output formats (text, json, html) ["text", "json"]
include_metadata Whether to include metadata in the output true
supported_types Media types that should be processed for extraction ["html", "books", "documents"]

Extractor Configuration

Extractors define how content is processed for specific file types:

Option Description Example
name Unique identifier for the extractor readability
extensions File extensions this extractor handles [".html", ".htm"]
type Extractor type: internal, external, or chain internal
priority Priority value (higher values tried first) 10
command External command to execute (for external type) /usr/bin/pdftotext
arguments Command arguments with placeholders -layout "{input}" "{output}"
steps Processing steps for chain extractors [See chain extractor example]

Hook Configuration

The hook configuration specifies when extraction is automatically triggered:

Option Description Default
event The event that triggers content extraction 'post_download'
script Hook script name 'extract_content'
enabled Enable or disable the hook true
arguments Hook-specific configuration options { output_formats = ["text", "json"] }

OCR Configuration for Comic Books

Comic book files (CBZ and CBR) support OCR text extraction using Tesseract. This feature extracts text from comic book pages, making the content searchable and indexable.

Requirement: Tesseract OCR must be installed for comic book text extraction to work. See the Installation page for setup instructions.

OCR Configuration Options

OCR behavior is configured in the [extraction.ocr] section of config.toml:

[extraction.ocr]
enabled = true
max_pages = 15
timeout_secs = 30
Option Description Default
enabled Enable or disable OCR text extraction true
max_pages Maximum number of pages to process with OCR 15
timeout_secs Timeout in seconds for OCR processing per page 30

How Comic Book OCR Works

  1. The comic book archive (CBZ/CBR) is extracted to a temporary directory
  2. Image files are identified and sorted for consistent page ordering
  3. Tesseract OCR is run on each image up to the max_pages limit
  4. Extracted text is combined with page numbers and metadata
  5. The final output includes both the extracted text and information about processing limits

OCR Output Example

When OCR is successful, the extracted content will include:

Comic book with 24 pages:

Page 1:
[No text detected]

Page 2:
KNOWN
ONLY AS CABLE!

... (more pages)

... and 9 more pages (OCR limited to first 15 pages)

Troubleshooting OCR

  • "OCR text extraction requires tesseract to be installed": Install Tesseract OCR using your system's package manager
  • "OCR is disabled in configuration": Set enabled = true in the [extraction.ocr] section
  • "[OCR extraction failed]": Check that the image files are valid and Tesseract can process them
  • Low text quality: Some comic book images may have low OCR accuracy due to artistic fonts or image quality

OCR for Standalone Images

In addition to comic book archives, GatherHub supports OCR text extraction from standalone image files. This feature allows you to extract text from screenshots, scanned documents, photos of text, infographics, and any other images containing readable text.

Supported Formats: JPG, JPEG, PNG, GIF, BMP, TIFF, TIF, WebP, SVG, HEIF, HEIC

Image OCR Configuration

Image OCR uses the same configuration as comic book OCR in the [extraction.ocr] section:

[[extractors]]
name = 'image'
extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp', '.svg', '.heif', '.heic']
type = 'internal'
priority = 20

# OCR configuration applies to both comic books and images
[extraction.ocr]
enabled = true
max_pages = 15      # For images, this is effectively 1
timeout_secs = 30   # Timeout per image

How Image OCR Works

  1. When an image file is processed, the system checks if OCR is enabled
  2. If Tesseract is available, it runs OCR on the image file directly
  3. Extracted text is returned with metadata about the extraction process
  4. The result includes the image filename as the title and word count

Image OCR Output Example

When processing a screenshot or document image, the output might look like:

{
  "Title": "dashboard",
  "PlainText": "Welcome to GatherHub, a content downloader for offline data horders.\nJob Status: Pending, Downloading, Completed, Failed\nMedia Types: audio, books, documents...",
  "Metadata": {
    "Extractor": "image+ocr",
    "OCR_Engine": "tesseract",
    "Length": "66 words"
  },
  "WordCount": 66,
  "SourcePath": "./screenshots/dashboard.png",
  "ContentType": "image/png"
}

Use Cases for Image OCR

  • Screenshots: Extract text from application screenshots for documentation
  • Scanned Documents: Convert scanned papers or receipts to searchable text
  • Photos of Text: Extract text from photos of whiteboards, signs, or documents
  • Infographics: Extract textual content from charts and infographics
  • Memes and Social Media: Extract text from image-based content
  • Technical Diagrams: Extract labels and annotations from technical drawings

Alternative Image Extractor Examples

The configuration file includes commented examples showing alternative approaches to image processing:

External Tesseract Extractor

For direct control over Tesseract parameters:

# Example: Alternative external image extractor using tesseract directly
[[extractors]]
name = 'image-external'
extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp']
type = 'external'
command = '/usr/bin/tesseract'
arguments = '{input} {output} -l eng'
priority = 15

Image Enhancement Chain

For improved OCR results through image preprocessing:

# Example: Chain extractor for image preprocessing + OCR
[[extractors]]
name = 'image-chain'
extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp']
type = 'chain'
priority = 25
steps = [
  { command = '/usr/bin/convert', args = '{input} -enhance -sharpen 0x1 {output}.enhanced.png' },
  { command = '/usr/bin/tesseract', args = '{output}.enhanced.png {output} -l eng' }
]

These examples demonstrate how you can:

  • Customize OCR settings: Specify different languages with -l eng
  • Enhance image quality: Use ImageMagick to improve images before OCR
  • Create processing pipelines: Chain multiple tools for better results
  • Override defaults: Use higher priority values to prefer custom extractors

Extractor Types

The system supports three types of extractors:

1. Internal Extractors

Built-in code that processes specific file types.

[[extractors]]
name = 'readability'
extensions = ['.html', '.htm']
type = 'internal'
priority = 10

Built-in extractors include:

  • readability: HTML extraction using go-readability
  • plaintext: Plain text file processing
  • pdf: Basic PDF handling
  • docx: Document handling
  • epub: Ebook handling
  • cbz: Comic book ZIP archives with OCR text extraction
  • cbr: Comic book RAR archives with OCR text extraction

2. External Extractors

Use external command-line tools for extraction.

[[extractors]]
name = 'pdftotext'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/pdftotext'
arguments = '-layout "{input}" "{output}"'
temp_format = 'txt'
priority = 20

External extractors allow you to use any command-line tool to process files:

  • Use {input} as a placeholder for the input file path
  • Use {output} as a placeholder for the output file path
  • Set temp_format to define the temporary file extension if needed

3. Chain Extractors

Link multiple processing steps together to handle complex formats.

[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 5
steps = [
  { command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
  { use = 'readability', input = '{output}.html' }
]

Chain extractors can:

  • Convert between formats using multiple tools
  • Execute a sequence of processing steps
  • Combine external commands with internal extractors
  • Pass the output of one step as input to the next

Media Type Integration

You can specify which extractor to use for each media type:

[[media_types]]
name = 'books'
patterns = ['.*\.(epub|pdf|mobi|azw|azw3|cbz|cbr|fb2|lit)$']
tool = 'aria2c'
tool_path = '/usr/bin/aria2c'
arguments = '...'
extractor = 'pdftotext'  # Specify which extractor to use

This allows you to:

  • Override the default extractor selection
  • Apply specific extractors to certain media types
  • Create media-type-specific processing pipelines

Output Formats

The extraction system supports three output formats:

Text Format

Plain text format provides the cleanest, most readable version of the content. This is ideal for:

  • Reading without distractions
  • Indexing by search systems
  • Analysis by text processing tools
  • Storage efficiency

JSON Format

JSON format contains the complete extracted content with all metadata in a structured format. This is ideal for:

  • Integration with other applications
  • Programmatic processing
  • Advanced filtering and search
  • Preserving all available metadata

HTML Format

HTML format provides a clean, styled version of the content for web viewing. This is ideal for:

  • Preserving document structure
  • Viewing in a web browser
  • Sharing readable content
  • Retaining basic formatting

Command Line Usage

In addition to the automatic extraction via hooks, you can manually extract content using the command line:

# Process a single file
gatherhub extract-content -input downloads/html/example.html -output downloads/extracted -format text

# Process all HTML files in a directory
gatherhub extract-content -input downloads/html -recursive -format json

# Process a file and output in all formats (uses the default formats)
gatherhub extract-content -input downloads/html/example.html -output downloads/extracted

Command Line Options

Option Description Default
-input Path to input file or directory (Required)
-output Output directory for extracted content ./downloads/extracted
-format Output format (text, json, html) text
-recursive Process directories recursively false
-help Show help false

Adding Custom Extractors

You can extend the system with your own extractors without modifying the code:

Adding an External Tool Extractor

For tools like Calibre's ebook-convert:

[[extractors]]
name = 'ebook-convert'
extensions = ['.epub', '.mobi']
type = 'external'
command = '/usr/bin/ebook-convert'
arguments = '"{input}" "{output}.txt"'
temp_format = 'txt'
priority = 15

Creating a Processing Chain

For complex multi-step processing:

[[extractors]]
name = 'docx-to-pdf-to-text'
extensions = ['.docx']
type = 'chain'
priority = 5
steps = [
  { command = '/usr/bin/libreoffice', args = '--convert-to pdf "{input}" --outdir "{output}"' },
  { command = '/usr/bin/pdftotext', args = '"{output}.pdf" "{output}.txt"' }
]

Fallback Extractors

You can define multiple extractors for the same file type with different priorities to create fallback chains:

# Primary PDF extractor
[[extractors]]
name = 'pdftotext'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/pdftotext'
arguments = '-layout "{input}" "{output}"'
priority = 20

# Fallback PDF extractor
[[extractors]]
name = 'alternative-pdf'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/alternate-pdf-tool'
arguments = '"{input}" -o "{output}"'
priority = 10

With this configuration, if the first extractor fails, the system will automatically try the second one.

Extracted Metadata

The system extracts and preserves various metadata from documents:

  • Title: Document title or article headline
  • Word Count: Number of words in the content
  • Author/Byline: When available in the document
  • Publication Date: When available in the document
  • Last Modified Date: When available in the document
  • Source: Original source domain or site name
  • Excerpt: Brief summary or introduction when available
  • Extractor: Which extractor was used to process the content

Use Cases

Content extraction is valuable for many scenarios:

  • Offline Reading: Extract clean, readable content for distraction-free reading
  • Research: Collect and organize article content from multiple sources
  • Archiving: Preserve the meaningful parts of web content in a compact format
  • Indexing: Create a searchable database of extracted content
  • Analysis: Process web content with text analysis tools
  • Content Feeds: Create clean RSS or syndication feeds from web content
  • Document Conversion: Convert between formats using processing chains

Troubleshooting

Debugging Tips: If extraction is not working as expected:
  • Check if the file's media type is in supported_types
  • Verify there's an extractor configured for the file extension
  • Ensure external tools are installed and accessible
  • Check the log files for error messages
  • Try running the command manually to see the output
  • For external extractors, test the command directly from the command line
  • For chain extractors, check each step individually
Search Results

Type to search documentation...