How-to guides/Document Info

Detect whether a PDF is scanned or contains selectable text

Before running text extraction, check whether the PDF was digitally created or is a scan of a physical document.

rust
// 1.0-compatible heuristic: count pages with near-empty extracted text.
// See deferred_note above for the full raster-analysis approach (1.1).
use pdfluent::prelude::*;

fn main() -> Result<()> {
    let doc = PdfDocument::open("maybe_scan.pdf")?;
    let total = doc.page_count();
    let text_pages = doc
        .pages()
        .filter(|p| p.text().map(|t| t.trim().len() > 20).unwrap_or(false))
        .count();
    let is_likely_scan = text_pages * 100 / total.max(1) < 10;
    println!("likely scanned: {is_likely_scan}");
    Ok(())
}
Install:cargo add pdfluent@1.0.0-beta.5Download SDK →

Step by step

1

Add PDFluent to Cargo.toml

No additional features are required. Page inspection is part of the base crate.

rust
# Cargo.toml
[dependencies]
pdfluent = "0.9"
2

Open the document and iterate pages

Use doc.pages() to get an iterator over all pages. Each Page gives you access to content stream analysis.

rust
use pdfluent::PdfDocument;

let doc = PdfDocument::open("document.pdf")?;

for (i, page) in doc.pages().enumerate() {
    println!("Page {}: {:?}", i + 1, page.content_type());
}
3

Check for selectable text and raster images

has_selectable_text() returns true if the page content stream contains any text operators. has_raster_images() returns true if the page contains XObject images.

rust
for page in doc.pages() {
    let has_text   = page.has_selectable_text();
    let has_images = page.has_raster_images();

    if !has_text && has_images {
        println!("This page appears to be a scan.");
    }
}
4

Get a document-level scan score

Count pages without text. A score above 80% is a strong indicator that the document is a scan or a mix.

rust
let total = doc.page_count() as f32;
let no_text = doc.pages()
    .filter(|p| !p.has_selectable_text())
    .count() as f32;

let scan_ratio = no_text / total;
println!("Scan ratio: {:.0}%", scan_ratio * 100.0);

if scan_ratio > 0.8 {
    println!("Likely a scanned document. Consider running OCR.");
}
5

Check whether text is hidden (OCR layer)

Some scanned PDFs have a hidden text layer added by OCR software. Use has_invisible_text() to detect this.

rust
for (i, page) in doc.pages().enumerate() {
    if page.has_invisible_text() {
        println!(
            "Page {} has an OCR text layer (invisible text).",
            i + 1
        );
    }
}

Notes and tips

  • A page with a background image and no text operators is the most common scan pattern. This method has low false-positive rates.
  • PDFs created by scanning software like Adobe Scan often include a hidden OCR text layer. has_invisible_text() detects this.
  • Vector PDFs with no images and no text (diagrams, flowcharts) return false for both flags. Use has_vector_content() for those.
  • Text that is covered by a white rectangle may still be detected as selectable text. Pixel-level analysis requires rasterizing the page.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions