Linux PDF Tools

Tools for Manipulating / Changing / Editing / Mending PDF Files

K2pdfopt: a PDF Reflow Tool

K2pdfopt optimizes PDF/DJVU files for mobile e-readers (e.g. the Kindle) and smartphones. It works well on multi-column PDF/DJVU files and can re-flow text even on scanned PDF files. It can also be used as a general PDF copying/ cropping/re-sizing/OCR-ing manipulation tool. It can generate native or bitmapped PDF output, with an optional OCR layer.

pdfarranger: merge, split and re-arrange pages from PDF documents

PDF Arranger is a small application which allows one to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

PDF Arranger was formerly known as PDF-Shuffler.

pdfchain: graphical user interface for the PDF Tool Kit

The package includes features designed to handle PDF files in a easy way. Basically it can merge, split, add backgrounds or stamps and add attachments. There are some tools for extended needs, too.

pdfposter: scale and tile PDF images/pages to print on multiple pages

Pdfposter can be used to create a large poster by building it from multiple pages and/or printing it on large media. It expects as input a PDF file, normally printing on a single page. The output is again a PDF file, maybe containing multiple pages together building the poster. The input page will be scaled to obtain the desired size.

This is much like poster does for Postscript files, but working with PDF. Since sometimes poster does not like your files converted from PDF. :-) Indeed pdfposter was inspired by poster. For more information please refer to the manpage or visit the project homepage.

pdfproctools: PDF Processing Tools

This package contains tools for PDF file processing.

SetPDFMetadata updates the metadata of a PDF file. In particular, it can be used to add outlines (bookmarks) to a document. Furthermore, it can set the document properties (e.g. author, title, keywords, creator, producer).

PDFEmbedFonts embeds all referenced fonts into a PDF file. Optionally, it can also linearize the PDF file for online publication ("fast web view", "optimized").

pdfresurrect: tool for extracting/scrubbing versioning data from PDF documents

PDFResurrect is a tool for analyzing and manipulating revisions to PDF documents (sometimes known as Adobe Acrobat files). The PDF format allows for previous changes to be retained in a revised version of the document, thereby keeping a running history of revisions to the document.

This tool extracts all previous revisions while also producing a summary of changes between revisions. It can also "scrub" or write data over the original instances of PDF objects that have been modified or deleted, in an effort to disguise information from previous versions that might not be intended for anyone else to read.

pdfsam: PDF Split and Merge
PDF Split and Merge is a very simple, easy to use, free, open source utility to split and merge pdf files. It has a simple graphical interface to let the user choose pdf files, split or merge them.
pdfsandwich: Tool to generate "sandwich" OCR pdf files

pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images. pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals.

It is able to recognize the page layout even for multicolumn text.

Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, unpaper, gs (only for psd resizing), hocr2pdf (for tesseract < 3.03), and tesseract.

pdftk-java: pdftk port to java - a tool for manipulating PDF documents

If PDF is electronic paper, then PDFtk is an electronic stapler-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses. PDFtk is a simple tool for doing everyday things with PDF documents. Keep one in the top drawer of your desktop and use it to:

  • Merge PDF documents
  • Split PDF pages into a new document
  • Decrypt input as necessary (password required)
  • Encrypt output as desired
  • Fill PDF Forms with FDF Data and/or Flatten Forms
  • Apply a Background Watermark
  • Report PDF on metrics, including metadata and bookmarks
  • Update PDF Metadata
  • Attach Files to PDF Pages or the PDF Document
  • Unpack PDF Attachments
  • Burst a PDF document into single pages
  • Uncompress and re-compress page streams
  • Repair corrupted PDF (where possible)
qpdf: tools for transforming and inspecting PDF files

QPDF is a program that can be used to linearize (web-optimize), encrypt (password-protect), decrypt, and inspect PDF files from the command-line. It does these and other structural, content-preserving transformations on PDF files, reading a PDF file as input and creating a new one as output. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work.

QPDF understands PDF files that use compressed object streams (supported by newer PDF applications) and can convert such files into those that can be read with older viewers. It can also be used for checking PDF files for structural errors, inspecting stream contents, or extracting objects from PDF files. QPDF is not PDF content creation or viewing software -- it does not have the capability to create PDF files from scratch or to display PDF files.

poppler-utils

Poppler is a PDF rendering library based on Xpdf PDF viewer.

This package contains command line utilities (based on Poppler) for getting information of PDF documents, convert them to other formats, or manipulate them:

  • pdfdetach: lists or extracts embedded files (attachments)
  • pdffonts: font analyzer
  • pdfimages: image extractor
  • pdfinfo: document information
  • pdfseparate: page extraction tool
  • pdfsig: verifies digital signatures
  • pdftocairo: PDF to PNG/JPEG/PDF/PS/EPS/SVG converter using Cairo
  • pdftohtml: PDF to HTML converter
  • pdftoppm: PDF to PPM/PNG/JPEG image converter
  • pdftops: PDF to PostScript (PS) converter
  • pdftotext: text extraction
  • pdfunite: document merging tool
krop: tool to crop PDF files

Krop is a simple graphical tool to crop the pages of PDF files. A unique feature of krop is its ability to automatically split pages into subpages to fit the limited screen size of devices such as eReaders. This is particularly useful, if your eReader does not support convenient scrolling.

Some settings that work:

  • select a rectangle to frame the page inside
  • don't check: Use GhostScript to optimize
  • do check: include pages without selections
  • which pages to include: your selection, such as 5-20
  • Selections apply to: all pages
pdfcrop

pdfcrop automatically detects and removes excess white space from PDF margins.

It is part of the TeX Live suite:

  • Debian/Ubuntu: sudo apt install texlive-extra-utils
  • Fedora: sudo dnf install texlive-pdftools

Usage to remove margins:

pdfcrop input.pdf output.pdf

Usage to set specific margin sizes (in bp, where 72 bp = 1 inch

pdfcrop --margins "10 20 10 20" input.pdf output.pdf

PDF Parsers / Analysers

pdfcrack: PDF files password cracker
PDFCrack is a simple tool for recovering passwords from pdf-documents.
pdfgrep: search in pdf files for strings matching a regular expression

Pdfgrep is a tool to search text in PDF files. It works similar to `grep'.

Features:

  • search for regular expressions.
  • support for some important grep options, including:

    • filename output.
    • page number output.
    • optional case insensitivity.
    • count occurrences.
    • and the most important feature: color output!
pdfminer: PDF parser and analyser (encoding data)
PDFMiner is a tool for extracting information from PDF documents, which focuses entirely on getting and analyzing text data.

PDF Viewers

mupdf

MuPDF is a lightweight PDF viewer and toolkit written in portable C. It also reads XPS, OpenXPS and ePub documents.

The renderer in MuPDF is tailored for high quality anti-aliased graphics. It renders text with metrics and spacing accurate to within fractions of a pixel for the highest fidelity in reproducing the look of a printed page on screen.

ViewPDF: Portable Document Format (PDF) viewer for GNUstep

ViewPDF is an application to view and navigate in PDF documents.

Key Features:

  • Zoom
  • Keyboard shortcuts for fast navigation
xpdf: Motif-based PDF reader using the Poppler library

xpdf is a light-weight open source viewer for Portable Document Format (PDF) files (also called 'Adobe Acrobat' or 'Acrobat' files). This is just the xpdf viewer client; various command-line pdf tools are now provided via the poppler-utils package.

Debian's xpdf is a fork of Xpdf version 3, modified to use the Poppler PDF rendering library but keeping the Motif toolkit, and nowadays maintained as the xpopple project.

zathura-pdf-poppler:

Some of the features are:

  • bookmarking pages
  • printing the whole document or specific pages
  • following links
  • searching in the document
  • browsing the document index
  • SyncTex forward and backward synchronization
qpdfview: tabbed document viewer

qpdfview is a simple tabbed document viewer which uses the Poppler library for PDF rendering and CUPS for printing and provides a clear and simple Qt graphical user interface. Support for the DjVu and PostScript formats can be added via plugins.

Current features include:

  • Outline, properties and thumbnail panes
  • Scale, rotate and fit
  • Fullscreen and presentation views
  • Continuous and multi-page layouts
  • Search for text (PDF and DjVu only)
  • Configurable toolbars
  • SyncTeX support (PDF only)
  • Partial annotation support (PDF only, Poppler version 0.20.1 or newer)
  • Partial form support (PDF only)
  • Persistent per-file settings
  • Support for DjVu and PostScript documents via plugins

Libraries for Making or Editing PDF Files

Haru (libhpdf*.*): a C library for generating pdf files

Haru is a free, cross platform, open-source C library for generating PDF files. It supports the following features:

  • Generation of PDF files with lines, text and images.
  • Outlines, text and link annotations.
  • Document compression using deflate-decode.
  • Embedded PNG and Jpeg images.
  • Embedded Type1 and TrueType fonts.
  • Creation of encrypted PDF files.
  • Usage of various character sets (ISO8859-1~16, MSCP1250~8, KOI8-R).
  • Support for CJK fonts and encodings.
libqpdf*: runtime library for PDF transformation/inspection software

QPDF is a program that can be used to linearize (web-optimize), encrypt (password-protect), decrypt, and inspect PDF files from the command-line. It does these and other structural, content-preserving transformations on PDF files, reading a PDF file as input and creating a new one as output. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work.

QPDF understands PDF files that use compressed object streams (supported by newer PDF applications) and can convert such files into those that can be read with older viewers. It can also be used for checking PDF files for structural errors, inspecting stream contents, or extracting objects from PDF files. QPDF is not PDF content creation or viewing software -- it does not have the capability to create PDF files from scratch or to display PDF files.

This package contains the qpdf runtime libraries required to run programs that link with the qpdf library.

libqt*pdf*: Qt n PDF library

The Qt PDF module contains classes and functions for rendering PDF documents.

ocrmypdf: add an OCR text layer to PDF files

OCRmyPDF generates a searchable PDF/A file from a regular PDF containing only images, allowing it to be searched.

It uses the Tesseract OCR engine and so supports all the languages that Tesseract does.

Some other main features:

  • * Places OCR text accurately below the image to ease copy / paste
  • * Keeps the exact resolution of the original embedded images
  • * When possible, inserts OCR information as a lossless operation without rendering vector information
  • * Keeps file size about the same
  • * If requested deskews and/or cleans the image before performing OCR
  • * Validates input and output files
  • * Provides debug mode to enable easy verification of the OCR results
  • * Processes pages in parallel when more than one CPU core is available
  • * Battle-tested on thousands of PDFs, a test suite and continuous integration.
PDF.js

PDF.js is a general-purpose, web standards-based platform for parsing and rendering PDFs.