AD Research Wiki
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

FrontPage

Revision 28 as of 2023-01-17 16:09:11
AD Research Wiki:
  • Projects
  • pdftotext++

Contents

  1. Points of Interests
  2. HOWTOs
    1. Create a PDF with human-readable objects + content streams
    2. Create a PDF with specified crop box
    3. Create A PDF without page numbering

Points of Interests

The following is a list of common steps executed by pdftotext to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit 065dca3 and may have changed by now.

  • Opening and reading the PDF file
       PDFDoc::PDFDoc(), line 144ff

  • Parsing the PDF version number from the PDF file header
       PDFDoc::checkHeader(), line 350

  • Parsing startxref
       PDFDoc::getStartXRef(), line 1999ff

  • Parsing the xref table and the trailer dictionary
       XRef::readXRefTable(), line 535

  • Parsing the document catalog
       Catalog::Catalog(), line 76

  • Extracting the PDF subtype
       PDFDoc::extractPDFSubtype(), line 516

  • Parsing a page (i.e., a page object node from the page tree)
       Catalog::cachePageTree(), line 315

  • Parsing the commands of a page's content stream
       Gfx::display(), line 644

  • Executing the commands of a page's content stream
       Gfx::go(), line 650

    • Executing a BT command (begin text)
         Gfx::opBeginText(), line 3585

    • Executing a Tj command (showing text)
         Gfx::opShowText(), line 3709

    • Executing a TJ command (showing text with individual glyph positioning)
         Gfx::opShowSpaceText(), line 3779

    • Executing a Tf command (setting font)
         Gfx::opSetFont(), line 3610

    • Executing a ET command (end text)
         Gfx::opEndText(), line 3595

HOWTOs

Create a PDF with human-readable objects + content streams

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfobjcompresslevel=0 
   2 \pdfcompresslevel=0

Create a PDF with specified crop box

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfpageattr{
   2   /CropBox [50 50 100 100]
   3 }

Create A PDF without page numbering

Put the following in the preamble of your TeX file:

   1 \thispagestyle{empty}

  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01