What this tool does

Copying text from a PDF and pasting it somewhere else is one of the most reliably terrible experiences in computing. The pasted text arrives broken: every line a separate paragraph, words split with hyphens at line ends, page numbers and headers interleaved with the actual content, smart quotes everywhere.

This tool fixes all of those problems in one pass. Paste the broken text, choose which cleanups to apply, and get back text that reads like the original document — minus the page furniture.

The four main fixes

Rejoin broken paragraphs

PDFs are structured around fixed-width pages. When you copy text, each visual line becomes a separate text line — even if those lines are part of the same paragraph. The result is text where one paragraph might be 8 separate lines, each maybe 70-80 characters long.

This tool joins those lines back into paragraphs. The heuristic: if a line doesn't end in sentence-ending punctuation (period, exclamation, question mark, colon, closing quote), it's probably continuing into the next line. Lines that DO end with sentence-ending punctuation are treated as paragraph breaks.

This isn't perfect — headings that don't end in periods will sometimes get joined to the following paragraph. The tool tries to detect short, title-cased lines that look like headings and keeps them separate, but you may need manual cleanup for documents with unusual structure.

Fix word-break hyphens

PDFs often hyphenate words at line breaks: "exam-" at the end of one line, "ple" at the start of the next. When you paste, you get "exam-\nple" instead of "example". This tool detects that pattern (letter, hyphen, newline, letter) and rejoins the word.

It's careful not to remove intentional hyphens. "Mother-in-law" stays intact because the hyphen isn't at a line break. Only hyphen-newline-letter sequences are rejoined.

Remove page numbers

Standalone lines that contain only a number (or "Page N" or "Page N of M" or "- N -") are removed. These are page footers that get included in the copy-paste but aren't part of the content.

Remove repeating headers and footers

If the same line appears 3 or more times in the text, it's almost certainly a running header or footer (the document title, the chapter name, the author name) that repeats on every page. This tool removes all instances of any short line that appears 3+ times.

The threshold (3 repetitions) is chosen because it's unusual for legitimate content to repeat identical short lines that often, while running headers always do. Long lines (over 100 chars) aren't subject to this filter — they're probably real content.

Smart-quote normalization

Many PDFs use typographic punctuation: curly quotes, em-dashes, ellipsis characters. The tool can normalize these back to ASCII equivalents (straight quotes, double hyphens, three periods). This is optional — if you specifically want to preserve typographic punctuation, uncheck this option.

Common use cases

Quoting from academic papers — paste a paragraph from a PDF and get reflowed prose ready to embed
Migrating PDF documentation to a wiki or CMS — convert structured PDFs into editable text
Pulling quotes from books or journals — extract clean text for citations or reviews
Cleaning legal or financial documents — court rulings, contracts, and SEC filings are often PDF-only
Preparing content for AI tools — LLMs work much better with reflowed text than line-broken PDF copy
Building research summaries — paste multiple PDF excerpts and get usable prose

What this can't fix

This tool works on the text you've already copied. If the PDF is a scanned image (no extractable text), copy-paste won't produce any text and this tool can't help — you need OCR first.

Multi-column PDFs are particularly tricky. When you copy across columns, the text often interleaves badly — line 1 of column 1, line 1 of column 2, line 2 of column 1, line 2 of column 2, etc. This tool can't reconstruct the original column order from interleaved input. For multi-column PDFs, copy one column at a time.

Tables in PDFs paste as a tab-separated mess that's not really text. This tool isn't a table extractor — use a dedicated PDF-to-CSV tool for that.

Iterative cleaning

For especially messy PDFs, run the output through this tool a second time. The first pass typically catches most issues; the second pass can clean up edge cases the first pass introduced (e.g., a paragraph that got joined incorrectly might be split by toggling options on the second pass).

You can also pipe the output through our whitespace cleaner for further spacing normalization, or invisible character finder to catch any weird Unicode that PDFs sometimes preserve.

Privacy

The text you paste runs through JavaScript in your browser. Nothing is uploaded. Content never leaves your machine.