AD Research Wiki
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

FrontPage

Revision 15 as of 2023-01-17 14:10:31
AD Research Wiki:
  • Projects
  • pdftotext++

Contents

  1. Points of Interests
  2. HOWTOs
    1. Create a PDF with human-readable objects + content streams
    2. Create a PDF with specified crop box
    3. Create A PDF without page numbering

Points of Interests

The following is a list of common steps executed by pdftotext to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit 065dca3 and may have changed by now.

  • Opening and reading the PDF file
       PDFDoc::PDFDoc(), line 144ff

  • Parsing the PDF version number from the PDF file header
       PDFDoc::checkHeader(), line 350

  • Parsing startxref
       PDFDoc::getStartXRef(), line 1999ff

  • Parsing the xref table and the trailer dictionary
       XRef::readXRefTable(), line 535

HOWTOs

Create a PDF with human-readable objects + content streams

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfobjcompresslevel=0 
   2 \pdfcompresslevel=0

Create a PDF with specified crop box

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfpageattr{
   2   /CropBox [50 50 100 100]
   3 }

Create A PDF without page numbering

Put the following in the preamble of your TeX file:

   1 \thispagestyle{empty}

  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01