<<TableOfContents(3)>>

= Points of Interests =

The following is a list of common steps executed by ''pdftotext'' to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit [[https://github.com/freedesktop/poppler/tree/065dca3816db3979dfacdc2f8592abed2ff6859a|065dca3]] and may have changed by now.

 * '''Opening and reading the PDF file''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L144|PDFDoc::PDFDoc(), line 144ff]]
 * '''Parsing the PDF version number from the PDF file header''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L350|PDFDoc::checkHeader(), line 350]]
 * '''Parsing startxref''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L1999|PDFDoc::getStartXRef(), line 1999ff]]
 * '''Parsing the xref table and the trailer dictionary''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/XRef.cc#L535|XRef::readXRefTable(), line 535]]
 * '''Parsing the document catalog''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L76|Catalog::Catalog(), line 76]]
 * '''Extracting the PDF subtype''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L516|PDFDoc::extractPDFSubtype(), line 516]]
 * '''Parsing a page (i.e., a page object node from the page tree)''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L315|Catalog::cachePageTree(), line 315]]
 * '''Parsing the commands of a page's content stream''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L644|Gfx::display(), line 644]]
 * '''Executing the commands of a page's content stream''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L650|Gfx::go(), line 650]]
  * Executing a `BT` command (begin text) <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3585|Gfx::opBeginText, line 3585]]
  * Executing a `Tj` command (showing text) <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3709|Gfx::opShowText, line 3709]]


= HOWTOs =

== Create a PDF with human-readable objects + content streams ==

Put the following in the preamble of your TeX file (between `\documentclass{}` and `\begin{document}`):

{{{#!highlight tex
\pdfobjcompresslevel=0 
\pdfcompresslevel=0
}}}

== Create a PDF with specified crop box ==

Put the following in the preamble of your TeX file (between `\documentclass{}` and `\begin{document}`):

{{{#!highlight tex
\pdfpageattr{
  /CropBox [50 50 100 100]
}
}}}

== Create A PDF without page numbering ==

Put the following in the preamble of your TeX file:

{{{#!highlight tex
\thispagestyle{empty}
}}}