How-to guides/Text Extraction

Extract text page by page from a PDF in Rust

Read the text content of each page as a plain string or as structured spans with font and position data.

rust
use pdfluent::prelude::*;

fn main() -> Result<()> {
    let doc = PdfDocument::open("document.pdf")?;

    for page in doc.pages() {
        let text = page.text()?;
        println!("--- Page {} ---", page.number());
        println!("{}", text);
    }
    Ok(())
}
Install:cargo add pdfluent@1.0.0-beta.5Download SDK →

Step by step

1

Open the document

Open the PDF. Text extraction is per-page and streams cleanly.

rust
use pdfluent::prelude::*;

let doc = PdfDocument::open("document.pdf")?;
2

Iterate pages and extract text

doc.pages() returns an iterator of Page<'_>. Each Page has a text() method that returns Result<String>.

rust
for page in doc.pages() {
    let text = page.text()?;
    println!("page {}: {} chars", page.number(), text.len());
}
3

Collect into a single String

For downstream processing, join the per-page strings with page separators.

rust
let combined: String = doc
    .pages()
    .map(|p| p.text().unwrap_or_default())
    .collect::<Vec<_>>()
    .join("\n\n");

Notes and tips

  • Text extraction follows the PDF content stream order, which may differ from visual reading order in multi-column layouts. Use extract_spans() and sort by rect position for precise column order.
  • Characters with custom encoding or Type3 fonts may not map cleanly to Unicode. PDFluent uses ToUnicode maps where available.
  • Encrypted PDFs must be opened with Document::open_with_password before text extraction.
  • For scanned PDFs without text layer, text() returns an empty string. You need OCR for image-based documents.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions