PDF text extraction with 97.5% pass rate. Preserve layout, reading order, and font metadata.
use pdfluent::{Sdk, extract::{TextOptions, text_with_layout()}};
let sdk = Sdk::init_with_license("license.json")?;
let doc = sdk.open("contract.pdf")?;
let opts = TextOptions::builder()
.reading_order(text_with_layout()::Spatial)
.include_font_info(true)
.include_coordinates(true)
.preserve_tables(true)
.build();
let result = doc.text(opts)?;
for page in result.pages() {
println!("\n=== Page {} ===", page.number());
for block in page.blocks() {
match block {
TextBlock::Paragraph(p) => {
println!(" [{}pt {}] {}", p.font_size(), p.font_name(), p.text());
},
TextBlock::Table(t) => {
for row in t.rows() {
println!(" | {} |", row.cells().join(" | "));
}
},
TextBlock::Heading(h) => {
println!(" ## [{}] {}", h.level(), h.text());
}
}
}
}Run cargo add pdfluent@1.0.0-beta.5 to get started.
Extract text with character-level accuracy. Preserve reading order, paragraph boundaries, and column layout for complex multi-column documents.
Text extraction passes 97.5% of test cases from the PDF 1.7 specification corpus. Failures are reported with exact coordinates for debugging.
Multi-column detection, table structure recognition, and list formatting. Extract text that makes sense in context, not just raw character streams.
Get font names, sizes, colors, and styles for each text run. Useful for document analysis, redaction detection, and content classification.
Every text block includes x, y, width, height coordinates. Map extracted text back to the original PDF for highlighting or annotation.
Get the complete page structure: paragraphs, headings, tables, lists, and inline elements. Structured output for indexing or transformation.