What is the difference between a scanned PDF and a digital PDF?

A digital PDF was created directly from a software application. It contains text operators in the page content stream. A scanned PDF is a photograph or scan of a paper document stored as a raster image inside the PDF. There is no selectable text unless OCR was applied.

Does PDFluent perform OCR?

PDFluent does not include an OCR engine. It can detect that a page is a scan and extract any existing hidden text layer. For full OCR, pipe the rasterized page output to an external OCR library such as Tesseract.

Can a single PDF have both scanned and digital pages?

Yes, this is common in mixed-content documents. The page-by-page approach in this guide handles that correctly by checking each page independently.

How do I rasterize a scanned PDF page to run OCR on it?

Call page.render(dpi) to get a RasterImage. This returns the pixel data at the requested DPI. Pass it to a Tesseract Rust binding such as tesseract-rs for OCR.

PDFluentSDK

← Editor Download

How-to guides/Document Info

Detect whether a PDF is scanned or contains selectable text

Before running text extraction, check whether the PDF was digitally created or is a scan of a physical document.

rust

use pdfluent::PdfDocument;

fn main() -> pdfluent::Result<()> {
    let doc = PdfDocument::open("file.pdf")?;
    let text = doc.extract_text()?;
    if text.trim().len() < 20 {
        println!("likely scanned (no text layer)");
    }
    Ok(())
}

Install:cargo add pdfluent@1.0.0-beta.17Download SDK →

Step by step

Add PDFluent to Cargo.toml

No additional features are required. Page inspection is part of the base crate.

rust

# Cargo.toml
[dependencies]
pdfluent = "1.0.0-beta.8"

Open the document and iterate pages

Use doc.pages() to get an iterator over all pages. Each Page gives you access to content stream analysis.

rust

use pdfluent::PdfDocument;

let doc = PdfDocument::open("document.pdf")?;

for i in 0..doc.page_count() {
    let has_text = !doc.page(i)?.text()?.trim().is_empty();
    println!("Page {}: selectable text = {}", i + 1, has_text);
}

Check for selectable text and raster images

has_selectable_text() returns true if the page content stream contains any text operators. has_raster_images() returns true if the page contains XObject images.

rust

for i in 0..doc.page_count() {
    let text = doc.page(i)?.text()?;
    if text.trim().is_empty() {
        println!("Page {} appears to be a scan (no selectable text).", i + 1);
    }
}

Get a document-level scan score

Count pages without text. A score above 80% is a strong indicator that the document is a scan or a mix.

rust

let total = doc.page_count() as f32;
let mut no_text = 0f32;
for i in 0..doc.page_count() {
    if doc.page(i)?.text()?.trim().is_empty() {
        no_text += 1.0;
    }
}

let scan_ratio = no_text / total;
println!("Scan ratio: {:.0}%", scan_ratio * 100.0);

if scan_ratio > 0.8 {
    println!("Likely a scanned document. Consider running OCR.");
}

Check whether text is hidden (OCR layer)

Some scanned PDFs have a hidden text layer added by OCR software. Use has_invisible_text() to detect this.

rust

// An image-only PDF yields little or no extractable text
let extracted = doc.extract_text()?;
if extracted.trim().is_empty() {
    println!("No extractable text layer - the document is image-only.");
}

Notes and tips

A page with a background image and no text operators is the most common scan pattern. This method has low false-positive rates.
PDFs created by scanning software like Adobe Scan often include a hidden OCR text layer. has_invisible_text() detects this.
Vector PDFs with no images and no text (diagrams, flowcharts) return false for both flags. Use has_vector_content() for those.
Text that is covered by a white rectangle may still be detected as selectable text. Pixel-level analysis requires rasterizing the page.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions

Download PDFluent

Detect whether a PDF is scanned or contains selectable text

Step by step

Add PDFluent to Cargo.toml

Open the document and iterate pages

Check for selectable text and raster images

Get a document-level scan score

Check whether text is hidden (OCR layer)

Notes and tips

Why PDFluent for this

Frequently asked questions

Related guides