How-to guides/Text Extraction

Extract text from a PDF in Rust

Read all text content from a PDF document. PDFluent preserves reading order and handles multi-column layouts, right-to-left scripts, and CID fonts.

rust
use pdfluent::prelude::*;

fn main() -> Result<()> {
    let doc = PdfDocument::open("document.pdf")?;

    for page in doc.pages() {
        let text = page.text()?;
        println!("--- Page {} ---", page.number());
        println!("{}", text);
    }
    Ok(())
}
Install:cargo add pdfluent@1.0.0-beta.5Download SDK →

Step by step

1

Open the document

Load the PDF. Text extraction works page by page, so memory usage stays low even for large documents.

rust
use pdfluent::prelude::*;

let doc = PdfDocument::open("contract.pdf")?;
2

Extract text from a single page

Access a page by its 1-based index and call text(). The method returns a plain String with words separated by spaces and paragraphs separated by newlines.

rust
let page = doc.page(1)?;
let text = page.text()?;
println!("{}", text);
3

Extract text from all pages

Iterate over doc.pages() to process every page. Each call to text() is independent.

rust
let full_text: String = doc
    .pages()
    .map(|p| p.text().unwrap_or_default())
    .collect::<Vec<_>>()
    .join("\n\n");
4

Extract text with layout positions

Use doc.text_with_layout() to get a Vec<TextBlock> at the document level. Each block carries the text, the page number, and the bounding box in PDF points (bottom-left origin).

rust
for block in doc.text_with_layout()? {
    println!(
        "[page {}] [{:.1},{:.1}] {:?}",
        block.page, block.x, block.y, block.text,
    );
}

Notes and tips

  • PDFluent decodes ToUnicode CMaps and Type1/TrueType encodings automatically.
  • Scanned PDFs with no embedded text return empty strings. Use an OCR step before extraction if needed.
  • Right-to-left text (Arabic, Hebrew) is returned in logical order, not visual order.
  • Ligatures and composed characters are decomposed to their Unicode equivalents where a mapping exists.
  • Page indexing is 1-based throughout the SDK (RFC 0001 §1).

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions