Why is the extracted text in the wrong order?

PDF content streams do not guarantee reading order. Glyphs can be drawn in any order. Use extract_spans() and sort spans by y descending then x ascending to approximate reading order.

How do I extract text from a specific region of a page?

Use page.extract_text_in_rect(rect) to limit extraction to a bounding rectangle. This is useful for extracting table cells or header regions.

Does extraction include text in annotations?

By default, text() only processes page content streams. Pass text_with_layout::include_annotations(true) to include annotation appearance streams.

How do I handle ligatures like fi or fl?

PDFluent decomposes ligatures to their constituent characters using Unicode NFKC normalization when building the text string. The raw glyph name is preserved in the TextSpan if you need it.

PDFluentSDK

← Editor Download

How-to guides/Text Extraction

Extract text page by page from a PDF in Rust

Read the text content of each page as a plain string or as structured spans with font and position data.

rust

use pdfluent::PdfDocument;

fn main() -> pdfluent::Result<()> {
    let doc = PdfDocument::open("file.pdf")?;
    for page in doc.pages() {
        println!("{}", page.text()?);
    }
    Ok(())
}

Install:cargo add pdfluent@1.0.0-beta.17Download SDK →

Step by step

Open the document

Open the PDF. Text extraction is per-page and streams cleanly.

rust

use pdfluent::prelude::*;

let doc = PdfDocument::open("document.pdf")?;

Iterate pages and extract text

doc.pages() returns an iterator of Page<'_>. Each Page has a text() method that returns Result<String>.

rust

for page in doc.pages() {
    let text = page.text()?;
    println!("page {}: {} chars", page.number(), text.len());
}

Collect into a single String

For downstream processing, join the per-page strings with page separators.

rust

let combined: String = doc
    .pages()
    .map(|p| p.text().unwrap_or_default())
    .collect::<Vec<_>>()
    .join("\n\n");

Notes and tips

Text extraction follows the PDF content stream order, which may differ from visual reading order in multi-column layouts. Use extract_spans() and sort by rect position for precise column order.
Characters with custom encoding or Type3 fonts may not map cleanly to Unicode. PDFluent uses ToUnicode maps where available.
Encrypted PDFs must be opened with Document::open_with_password before text extraction.
For scanned PDFs without text layer, text() returns an empty string. You need OCR for image-based documents.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions

Download PDFluent

Extract text page by page from a PDF in Rust

Step by step

Open the document

Iterate pages and extract text

Collect into a single String

Notes and tips

Why PDFluent for this

Frequently asked questions

Related guides