Content Extraction

The content extraction feature helps you extract clean, readable content from HTML and other document types, removing clutter, navigation elements, and ads for a better reading experience and more effective indexing.

Overview

Many web pages include a significant amount of content that isn't part of the main article or document - things like navigation menus, advertisements, sidebars, footers, and other distractions. The content extraction feature in GatherHub uses advanced algorithms and flexible processing pipelines to identify and extract just the meaningful content, greatly improving readability and making the content more usable for search and analysis.

Supported Document Types

GatherHub's content extraction feature supports:

HTML/HTM: Full support with go-readability for extracting article content
TXT: Basic processing and word count
PDF: Support via configurable extractors (can use multiple approaches)
DOCX/DOC: Support via configurable extractors
EPUB: Support via configurable extractors
CBZ/CBR: Comic book archives with OCR text extraction (requires Tesseract)
Images: JPG, PNG, GIF, BMP, TIFF, WebP, SVG, HEIF, HEIC with OCR text extraction (requires Tesseract)
Any Format: Support for any format can be added with external tools or processing chains

Extensible System: The extraction system is highly extensible and can support any document format by configuring appropriate extractors.

How It Works

The content extraction process follows these steps:

When a file is downloaded (or processed manually), GatherHub identifies its format based on the file extension
The system finds all matching extractors for that extension
Extractors are prioritized based on their configured priority values
Each extractor is tried in order until one succeeds
The extracted content is saved in all configured output formats to the specified output directory
Metadata like title, author, word count, and publish date are preserved when available

Configuration

Content extraction is configured in the config.toml file with three main sections:

# Global extraction settings
[extraction]
enabled = true
output_dir = './downloads/extracted'
output_formats = ['text', 'json', 'html']
include_metadata = true
supported_types = ['html', 'books', 'documents']

# Individual extractors
[[extractors]]
name = 'readability'
extensions = ['.html', '.htm']
type = 'internal'
priority = 10

# Hook configuration
[[event_hooks.hooks]]
event = 'post_download'
script = 'extract_content'
enabled = true
arguments = { output_formats = ["text", "json", "html"], output_dir = "./downloads/extracted" }

Global Extraction Settings

Option	Description	Default
`enabled`	Enable or disable content extraction	`true`
`output_dir`	Directory where extracted content is saved	`"./downloads/extracted"`
`output_formats`	Array of output formats (`text`, `json`, `html`)	`["text", "json"]`
`include_metadata`	Whether to include metadata in the output	`true`
`supported_types`	Media types that should be processed for extraction	`["html", "books", "documents"]`

Extractor Configuration

Extractors define how content is processed for specific file types:

Option	Description	Example
`name`	Unique identifier for the extractor	`readability`
`extensions`	File extensions this extractor handles	`[".html", ".htm"]`
`type`	Extractor type: `internal`, `external`, or `chain`	`internal`
`priority`	Priority value (higher values tried first)	`10`
`command`	External command to execute (for external type)	`/usr/bin/pdftotext`
`arguments`	Command arguments with placeholders	`-layout "{input}" "{output}"`
`steps`	Processing steps for chain extractors	[See chain extractor example]

Hook Configuration

The hook configuration specifies when extraction is automatically triggered:

Option	Description	Default
`event`	The event that triggers content extraction	`'post_download'`
`script`	Hook script name	`'extract_content'`
`enabled`	Enable or disable the hook	`true`
`arguments`	Hook-specific configuration options	`{ output_formats = ["text", "json"] }`

OCR Configuration for Comic Books

Comic book files (CBZ and CBR) support OCR text extraction using Tesseract. This feature extracts text from comic book pages, making the content searchable and indexable.

Requirement: Tesseract OCR must be installed for comic book text extraction to work. See the Installation page for setup instructions.

OCR Configuration Options

OCR behavior is configured in the [extraction.ocr] section of config.toml:

[extraction.ocr]
enabled = true
max_pages = 15
timeout_secs = 30

Option	Description	Default
`enabled`	Enable or disable OCR text extraction	`true`
`max_pages`	Maximum number of pages to process with OCR	`15`
`timeout_secs`	Timeout in seconds for OCR processing per page	`30`

How Comic Book OCR Works

The comic book archive (CBZ/CBR) is extracted to a temporary directory
Image files are identified and sorted for consistent page ordering
Tesseract OCR is run on each image up to the max_pages limit
Extracted text is combined with page numbers and metadata
The final output includes both the extracted text and information about processing limits

OCR Output Example

When OCR is successful, the extracted content will include:

Comic book with 24 pages:

Page 1:
[No text detected]

Page 2:
KNOWN
ONLY AS CABLE!

... (more pages)

... and 9 more pages (OCR limited to first 15 pages)

Troubleshooting OCR

"OCR text extraction requires tesseract to be installed": Install Tesseract OCR using your system's package manager
"OCR is disabled in configuration": Set enabled = true in the [extraction.ocr] section
"[OCR extraction failed]": Check that the image files are valid and Tesseract can process them
Low text quality: Some comic book images may have low OCR accuracy due to artistic fonts or image quality

OCR for Standalone Images

In addition to comic book archives, GatherHub supports OCR text extraction from standalone image files. This feature allows you to extract text from screenshots, scanned documents, photos of text, infographics, and any other images containing readable text.

Supported Formats: JPG, JPEG, PNG, GIF, BMP, TIFF, TIF, WebP, SVG, HEIF, HEIC

Image OCR Configuration

Image OCR uses the same configuration as comic book OCR in the [extraction.ocr] section:

[[extractors]]
name = 'image'
extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp', '.svg', '.heif', '.heic']
type = 'internal'
priority = 20

# OCR configuration applies to both comic books and images
[extraction.ocr]
enabled = true
max_pages = 15      # For images, this is effectively 1
timeout_secs = 30   # Timeout per image

How Image OCR Works

When an image file is processed, the system checks if OCR is enabled
If Tesseract is available, it runs OCR on the image file directly
Extracted text is returned with metadata about the extraction process
The result includes the image filename as the title and word count

Image OCR Output Example

When processing a screenshot or document image, the output might look like:

{
  "Title": "dashboard",
  "PlainText": "Welcome to GatherHub, a content downloader for offline data horders.\nJob Status: Pending, Downloading, Completed, Failed\nMedia Types: audio, books, documents...",
  "Metadata": {
    "Extractor": "image+ocr",
    "OCR_Engine": "tesseract",
    "Length": "66 words"
  },
  "WordCount": 66,
  "SourcePath": "./screenshots/dashboard.png",
  "ContentType": "image/png"
}

Use Cases for Image OCR

Screenshots: Extract text from application screenshots for documentation
Scanned Documents: Convert scanned papers or receipts to searchable text
Photos of Text: Extract text from photos of whiteboards, signs, or documents
Infographics: Extract textual content from charts and infographics
Memes and Social Media: Extract text from image-based content
Technical Diagrams: Extract labels and annotations from technical drawings

Alternative Image Extractor Examples

The configuration file includes commented examples showing alternative approaches to image processing:

External Tesseract Extractor

For direct control over Tesseract parameters:

# Example: Alternative external image extractor using tesseract directly
[[extractors]]
name = 'image-external'
extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp']
type = 'external'
command = '/usr/bin/tesseract'
arguments = '{input} {output} -l eng'
priority = 15

Image Enhancement Chain

For improved OCR results through image preprocessing:

# Example: Chain extractor for image preprocessing + OCR
[[extractors]]
name = 'image-chain'
extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif', '.webp']
type = 'chain'
priority = 25
steps = [
  { command = '/usr/bin/convert', args = '{input} -enhance -sharpen 0x1 {output}.enhanced.png' },
  { command = '/usr/bin/tesseract', args = '{output}.enhanced.png {output} -l eng' }
]

These examples demonstrate how you can:

Customize OCR settings: Specify different languages with -l eng
Enhance image quality: Use ImageMagick to improve images before OCR
Create processing pipelines: Chain multiple tools for better results
Override defaults: Use higher priority values to prefer custom extractors

Extractor Types

The system supports three types of extractors:

1. Internal Extractors

Built-in code that processes specific file types.

[[extractors]]
name = 'readability'
extensions = ['.html', '.htm']
type = 'internal'
priority = 10

Built-in extractors include:

readability: HTML extraction using go-readability
plaintext: Plain text file processing
pdf: Basic PDF handling
docx: Document handling
epub: Ebook handling
cbz: Comic book ZIP archives with OCR text extraction
cbr: Comic book RAR archives with OCR text extraction

2. External Extractors

Use external command-line tools for extraction.

[[extractors]]
name = 'pdftotext'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/pdftotext'
arguments = '-layout "{input}" "{output}"'
temp_format = 'txt'
priority = 20

External extractors allow you to use any command-line tool to process files:

Use {input} as a placeholder for the input file path
Use {output} as a placeholder for the output file path
Set temp_format to define the temporary file extension if needed

3. Chain Extractors

Link multiple processing steps together to handle complex formats.

[[extractors]]
name = 'pdf-to-html-to-readability'
extensions = ['.pdf']
type = 'chain'
priority = 5
steps = [
  { command = '/usr/bin/pdf2htmlEX', args = '"{input}" "{output}.html"' },
  { use = 'readability', input = '{output}.html' }
]

Chain extractors can:

Convert between formats using multiple tools
Execute a sequence of processing steps
Combine external commands with internal extractors
Pass the output of one step as input to the next

Media Type Integration

You can specify which extractor to use for each media type:

[[media_types]]
name = 'books'
patterns = ['.*\.(epub|pdf|mobi|azw|azw3|cbz|cbr|fb2|lit)$']
tool = 'aria2c'
tool_path = '/usr/bin/aria2c'
arguments = '...'
extractor = 'pdftotext'  # Specify which extractor to use

This allows you to:

Override the default extractor selection
Apply specific extractors to certain media types
Create media-type-specific processing pipelines

Output Formats

The extraction system supports three output formats:

Text Format

Plain text format provides the cleanest, most readable version of the content. This is ideal for:

Reading without distractions
Indexing by search systems
Analysis by text processing tools
Storage efficiency

JSON Format

JSON format contains the complete extracted content with all metadata in a structured format. This is ideal for:

Integration with other applications
Programmatic processing
Advanced filtering and search
Preserving all available metadata

HTML Format

HTML format provides a clean, styled version of the content for web viewing. This is ideal for:

Preserving document structure
Viewing in a web browser
Sharing readable content
Retaining basic formatting

Command Line Usage

In addition to the automatic extraction via hooks, you can manually extract content using the command line:

# Process a single file
gatherhub extract-content -input downloads/html/example.html -output downloads/extracted -format text

# Process all HTML files in a directory
gatherhub extract-content -input downloads/html -recursive -format json

# Process a file and output in all formats (uses the default formats)
gatherhub extract-content -input downloads/html/example.html -output downloads/extracted

Command Line Options

Option	Description	Default
`-input`	Path to input file or directory	(Required)
`-output`	Output directory for extracted content	`./downloads/extracted`
`-format`	Output format (text, json, html)	`text`
`-recursive`	Process directories recursively	`false`
`-help`	Show help	`false`

Adding Custom Extractors

You can extend the system with your own extractors without modifying the code:

Adding an External Tool Extractor

For tools like Calibre's ebook-convert:

[[extractors]]
name = 'ebook-convert'
extensions = ['.epub', '.mobi']
type = 'external'
command = '/usr/bin/ebook-convert'
arguments = '"{input}" "{output}.txt"'
temp_format = 'txt'
priority = 15

Creating a Processing Chain

For complex multi-step processing:

[[extractors]]
name = 'docx-to-pdf-to-text'
extensions = ['.docx']
type = 'chain'
priority = 5
steps = [
  { command = '/usr/bin/libreoffice', args = '--convert-to pdf "{input}" --outdir "{output}"' },
  { command = '/usr/bin/pdftotext', args = '"{output}.pdf" "{output}.txt"' }
]

Fallback Extractors

You can define multiple extractors for the same file type with different priorities to create fallback chains:

# Primary PDF extractor
[[extractors]]
name = 'pdftotext'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/pdftotext'
arguments = '-layout "{input}" "{output}"'
priority = 20

# Fallback PDF extractor
[[extractors]]
name = 'alternative-pdf'
extensions = ['.pdf']
type = 'external'
command = '/usr/bin/alternate-pdf-tool'
arguments = '"{input}" -o "{output}"'
priority = 10

With this configuration, if the first extractor fails, the system will automatically try the second one.

Extracted Metadata

The system extracts and preserves various metadata from documents:

Title: Document title or article headline
Word Count: Number of words in the content
Author/Byline: When available in the document
Publication Date: When available in the document
Last Modified Date: When available in the document
Source: Original source domain or site name
Excerpt: Brief summary or introduction when available
Extractor: Which extractor was used to process the content

Use Cases

Content extraction is valuable for many scenarios:

Offline Reading: Extract clean, readable content for distraction-free reading
Research: Collect and organize article content from multiple sources
Archiving: Preserve the meaningful parts of web content in a compact format
Indexing: Create a searchable database of extracted content
Analysis: Process web content with text analysis tools
Content Feeds: Create clean RSS or syndication feeds from web content
Document Conversion: Convert between formats using processing chains

Troubleshooting

Debugging Tips: If extraction is not working as expected:

Check if the file's media type is in supported_types
Verify there's an extractor configured for the file extension
Ensure external tools are installed and accessible
Check the log files for error messages
Try running the command manually to see the output
For external extractors, test the command directly from the command line
For chain extractors, check each step individually