AD Research Wiki
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

FrontPage

Upload page content

You can upload content for the page named below. If you change the page name, you can also upload content for another page. If the page name is empty, we derive the page name from the file name.

File to load page content from
Page name
Comment

AD Research Wiki:
  • Projects
  • pdftotext++

Contents

  1. Points of Interests
  2. HOWTOs
    1. Create a PDF with human-readable objects + content streams
    2. Create a PDF with specified crop box
    3. Create A PDF without page numbering

Points of Interests

The following is a list of common steps executed by pdftotext to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit 065dca3 and may have changed by now.

  • Opening and reading the PDF file
       PDFDoc::PDFDoc(), line 144ff

  • Parsing the PDF version number from the PDF file header
       PDFDoc::checkHeader(), line 350

  • Parsing startxref
       PDFDoc::getStartXRef(), line 1999ff

  • Parsing the xref table and the trailer dictionary
       XRef::readXRefTable(), line 535

  • Parsing the document catalog
       Catalog::Catalog(), line 76

  • Extracting the PDF subtype
       PDFDoc::extractPDFSubtype(), line 516

  • Parsing a page (i.e., a page object node from the page tree)
       Catalog::cachePageTree(), line 315

  • Parsing the commands of a page's content stream
       Gfx::display(), line 644

  • Executing the commands of a page's content stream
       Gfx::go(), line 650

    • Executing a BT command (begin text)
         Gfx::opBeginText(), line 3585

    • Executing a Tj command (showing text)
         Gfx::opShowText(), line 3709

    • Executing a TJ command (showing text with individual glyph positioning)
         Gfx::opShowSpaceText(), line 3779

    • Executing a ' command (moving to next line and showing text)
         Gfx::opMoveShowText(), line 3727

    • Executing a " command (moving to next line, setting the word and character spacing, and showing text)
         Gfx::opMoveSetShowText(), line 3751

    • Executing a Tf command (setting font)
         Gfx::opSetFont(), line 3610

    • Executing a ET command (end text)
         Gfx::opEndText(), line 3595

  • Combining diacritical marks
       TextPage::addChar(), line 2699

  • Detecting words
       TextPage::addChar(), line 2648

  • Detecting text blocks
       TextPage::coalesce(), line 3017

  • Detecting text lines in each text block
       TextBlock::coalesce(), line 1725

  • Detecting tables
       TextPage::coalesce, line 3487ff

  • Detecting reading order
       TextPage::coalesce, line 3720ff

  • Detecting text flows
       TextPage::coalesce, line 3753ff

HOWTOs

Create a PDF with human-readable objects + content streams

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfobjcompresslevel=0 
   2 \pdfcompresslevel=0

Create a PDF with specified crop box

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfpageattr{
   2   /CropBox [50 50 100 100]
   3 }

Create A PDF without page numbering

Put the following in the preamble of your TeX file:

   1 \thispagestyle{empty}

  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01