FIND-20260324-024 — extractous — Fast Rust unstructured data extraction (25x faster than unstructured-io)

extractous is a Rust-native library for fast and efficient extraction of text and metadata from unstructured documents (PDF, Word, HTML, images, 15+ formats). The core is written in Rust (69%) for memory safety and performance, with Apache Tika integrated via GraalVM for extended format support, and Tesseract OCR for image/scanned document extraction. Benchmarks show it is 25x faster than the popular unstructured-io Python library and uses 11x less memory. Python bindings are available; JavaScript/TypeScript bindings are planned. Licensed Apache-2.0. The project was featured by @tom_doerr on X today and has 1.7k GitHub stars. Last commit: December 2024 (15 months ago) — maintenance is stale but the library appears functionally stable.

Directly relevant to PDF Engine (text and metadata extraction from uploaded documents), DocStore (content indexing for search), and Form Engine (extracting structured data from scanned or uploaded form images). Could replace or complement any Apache Tika Java dependency currently used for document parsing, with significantly lower memory footprint — important for ODS's constrained GCP e2-standard-4 VPS fleet. The Rust-native core aligns with the ODS stack preference for Rust microservices. Python bindings also enable use in data platform scripts and Metabase/ClickHouse ingestion pipelines.

extractous — Fast Rust unstructured data extraction (25x faster than unstructured-io)

Source

ODS Impact

Security Review

Tags