Advanced

How to Make a Scanned PDF Searchable With OCR

April 2026 · 6 min read

What is a scanned PDF

A scanned PDF is a document where the pages are images - photographs of paper - rather than text stored as data. When you scan a paper document on a flatbed or sheet-fed scanner, the scanner captures a bitmap image of the page. The PDF wraps that image. The text you see is pixels; there is no underlying text layer.

The consequence is that you cannot search a scanned PDF for a word, select text to copy it, or have a screen reader interpret the content. From the perspective of any software that works with text, the document is opaque - just an image.

Many documents arrive in this form: old archives scanned for digitization, paper contracts photographed on a phone and converted to PDF, and faxes that have been printed and re-scanned all produce image-only PDFs.

How OCR works

Optical Character Recognition analyses image pixels and infers what characters are present based on their visual shape. The OCR engine has been trained on millions of characters in many fonts and sizes. It identifies regions of the image that look like text, segments those regions into individual characters, and matches each character to the most likely letter, number, or symbol.

The output of OCR is a text layer that is embedded behind the visible image in the PDF. The page looks the same - the image is still there - but now there is also text data that search engines, copy functions, and screen readers can access. This is called a searchable PDF or a PDF with a hidden text layer.

OCR accuracy depends on the source image quality. A clean, straight scan at 300 DPI with good contrast produces near-perfect OCR. A blurry phone photo, a photocopy of a photocopy, or handwritten text produces lower accuracy. Most typed documents scan well; most handwriting does not.

How to OCR with PDFsuite

Open /tools/ocr and upload your scanned PDF. The tool uses Tesseract.js, a WebAssembly build of the open-source Tesseract OCR engine. Everything runs in your browser - the scanned document never leaves your device.

Select the document language. OCR accuracy improves significantly when the engine knows what language to expect. Tesseract supports over 100 languages. If your document is in English, the default is correct. For multilingual documents, select the primary language.

Click Run OCR. The tool processes each page sequentially. For a 10-page scan, this typically takes 30-60 seconds depending on your device. A progress bar shows which page is being processed. When complete, the searchable PDF downloads automatically.

After OCR - verifying accuracy

Open the output PDF and use Ctrl+F (or Cmd+F) to search for a word you know is on a specific page. If search finds it, OCR succeeded at a basic level. Try several words from different parts of the document, including any that appear in different fonts or sizes.

Select a paragraph of text and copy it to a text editor. Compare the pasted text with the visible document. Any OCR errors will be apparent - transposed letters, incorrect punctuation, numbers misread as letters (the classic 0/O and 1/l confusions).

For documents where OCR accuracy matters - contracts you will search by clause, medical records you need to reference by term - review the text layer carefully for errors. Light cleanup in a text editor and re-embedding is possible but involved. For casual searchability, minor OCR errors are usually acceptable.

OCR for languages and special characters

Tesseract handles Latin-script languages excellently - Western European languages, including accented characters, work well. Arabic, Hebrew, Chinese, Japanese, and Korean are supported but may show lower accuracy than Latin-script languages, particularly for complex character sets.

Mathematical notation, chemical formulas, and non-standard symbols are challenging for general OCR engines. Tesseract is trained primarily on prose text. If your document contains extensive formulas or notation, expect more errors in those sections.

Handwriting is largely beyond what Tesseract can handle reliably. It is trained on printed type. Some cloud OCR services (Google Vision, AWS Textract) are better trained on handwriting. For handwritten documents, browser-based OCR is not the right tool.

OCR and file size

Adding an OCR text layer increases the PDF file size slightly - typically by 10-30% depending on the amount of text. The image data is unchanged; the text layer adds a relatively small amount of additional data.

If you need to reduce file size after OCR, run the result through /tools/compress. Image compression will reduce the scan image quality while keeping the text layer intact. The searchable text is preserved regardless of image compression level.

For archival purposes, keep both the original scanned PDF and the OCR-processed version. The original is the definitive visual record. The OCR version is the working searchable copy.

Try it yourself

Process your PDFs in the browser.

All 28 tools. Files never leave your device. $29/year.