How-to guides/Text Extraction

Extract text with bounding box positions in Rust

Get each word or character with its x, y, width, and height on the page. Useful for building search, redaction, or document analysis tools.

rust
use pdfluent::prelude::*;

fn main() -> Result<()> {
    let doc = PdfDocument::open("document.pdf")?;

    for block in doc.text_with_layout()? {
        println!(
            "[page {}] [{:.1},{:.1}] {:?}",
            block.page, block.x, block.y, block.text,
        );
    }
    Ok(())
}
Install:cargo add pdfluent@1.0.0-beta.5Download SDK →

Step by step

1

Open the document

Load the PDF.

rust
use pdfluent::prelude::*;

let doc = PdfDocument::open("document.pdf")?;
2

Call text_with_layout

Returns Vec<TextBlock> document-wide. Each TextBlock carries the text, its 1-based page number, and bounding-box coordinates in PDF points (bottom-left origin).

rust
let blocks = doc.text_with_layout()?;
println!("{} text blocks", blocks.len());
3

Access per-block fields

Read block.page, block.x, block.y, block.width, block.height, block.text.

rust
for block in doc.text_with_layout()? {
    if block.page == 1 {
        println!("[{:.1},{:.1}] {:?}", block.x, block.y, block.text);
    }
}

Notes and tips

  • Coordinates use the PDF coordinate system: origin at the bottom-left, y increases upward.
  • For screen rendering where y starts at the top, compute screen_y = page_height_pts - (word.y + word.height).
  • Word grouping is heuristic. Very close characters that share a text run are merged into one word entry.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions