Points of Interests

The following is a list of common steps executed by pdftotext to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit 065dca3 and may have changed by now.

Opening and reading the PDF file
PDFDoc::PDFDoc(), line 144ff
Parsing the PDF version number from the PDF file header
PDFDoc::checkHeader(), line 350
Parsing startxref
PDFDoc::getStartXRef(), line 1999ff
Parsing the xref table and the trailer dictionary
XRef::readXRefTable(), line 535
Parsing the document catalog
Catalog::Catalog(), line 76
Extracting the PDF subtype
PDFDoc::extractPDFSubtype(), line 516
Parsing a page (i.e., a page object node from the page tree)
Catalog::cachePageTree(), line 315
Parsing the commands of a page's content stream
Gfx::display(), line 644

HOWTOs

Create a PDF with human-readable objects + content streams

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfobjcompresslevel=0 
   2 \pdfcompresslevel=0

Create a PDF with specified crop box

Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):

   1 \pdfpageattr{
   2   /CropBox [50 50 100 100]
   3 }

Create A PDF without page numbering

Put the following in the preamble of your TeX file:

   1 \thispagestyle{empty}

-  ⇤ ← Revision 6 as of 2023-01-17 13:54:14 → 
  Size: 1014
  Editor: adpult
  Comment:
+   ← Revision 21 as of 2023-01-17 14:21:01 → ⇥
  Size: 2771
  Editor: adpult
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 5:
-The following is a list of common steps executed by pdftotext to extract text from a PDF file + in which file the corresponding code is located. Note that the stated locations refer to commit [[https://github.com/freedesktop/poppler/tree/065dca3816db3979dfacdc2f8592abed2ff6859a|065dca3]] and may have changed by now.
+The following is a list of common steps executed by ''pdftotext'' to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit [[https://github.com/freedesktop/poppler/tree/065dca3816db3979dfacdc2f8592abed2ff6859a|065dca3]] and may have changed by now.
 Line 7:
+ * '''Opening and reading the PDF file''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L144|PDFDoc::PDFDoc(), line 144ff]]
 * '''Parsing the PDF version number from the PDF file header''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L350|PDFDoc::checkHeader(), line 350]]
 * '''Parsing startxref''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L1999|PDFDoc::getStartXRef(), line 1999ff]]
 * '''Parsing the xref table and the trailer dictionary''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/XRef.cc#L535|XRef::readXRefTable(), line 535]]
 * '''Parsing the document catalog''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L76|Catalog::Catalog(), line 76]]
 * '''Extracting the PDF subtype''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L516|PDFDoc::extractPDFSubtype(), line 516]]
 * '''Parsing a page (i.e., a page object node from the page tree)''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L315|Catalog::cachePageTree(), line 315]]
 * '''Parsing the commands of a page's content stream''' <<BR>> &nbsp;&nbsp;&nbsp;[[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L644|Gfx::display(), line 644]]