1894
Comment:
|
4300
|
Deletions are marked like this. | Additions are marked like this. |
Line 11: | Line 11: |
* '''Parsing the document catalog''' <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L76|Catalog::Catalog(), line 76]] * '''Extracting the PDF subtype''' <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L516|PDFDoc::extractPDFSubtype(), line 516]] * '''Parsing a page (i.e., a page object node from the page tree)''' <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L315|Catalog::cachePageTree(), line 315]] * '''Parsing the commands of a page's content stream''' <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L644|Gfx::display(), line 644]] * '''Executing the commands of a page's content stream''' <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L650|Gfx::go(), line 650]] * Executing a `BT` command (begin text) <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3585|Gfx::opBeginText(), line 3585]] * Executing a `Tj` command (showing text) <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3709|Gfx::opShowText(), line 3709]] * Executing a `TJ` command (showing text with individual glyph positioning) <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3779|Gfx::opShowSpaceText(), line 3779]] * Executing a `'` command (moving to next line and showing text) <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3727|Gfx::opMoveShowText(), line 3727]] * Executing a `Tf` command (setting font) <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3610|Gfx::opSetFont(), line 3610]] * Executing a `ET` command (end text) <<BR>> [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3595|Gfx::opEndText(), line 3595]] |
Contents
Points of Interests
The following is a list of common steps executed by pdftotext to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit 065dca3 and may have changed by now.
Opening and reading the PDF file
PDFDoc::PDFDoc(), line 144ffParsing the PDF version number from the PDF file header
PDFDoc::checkHeader(), line 350Parsing startxref
PDFDoc::getStartXRef(), line 1999ffParsing the xref table and the trailer dictionary
XRef::readXRefTable(), line 535Parsing the document catalog
Catalog::Catalog(), line 76Extracting the PDF subtype
PDFDoc::extractPDFSubtype(), line 516Parsing a page (i.e., a page object node from the page tree)
Catalog::cachePageTree(), line 315Parsing the commands of a page's content stream
Gfx::display(), line 644Executing the commands of a page's content stream
Gfx::go(), line 650Executing a BT command (begin text)
Gfx::opBeginText(), line 3585Executing a Tj command (showing text)
Gfx::opShowText(), line 3709Executing a TJ command (showing text with individual glyph positioning)
Gfx::opShowSpaceText(), line 3779Executing a ' command (moving to next line and showing text)
Gfx::opMoveShowText(), line 3727Executing a Tf command (setting font)
Gfx::opSetFont(), line 3610Executing a ET command (end text)
Gfx::opEndText(), line 3595
HOWTOs
Create a PDF with human-readable objects + content streams
Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):
Create a PDF with specified crop box
Put the following in the preamble of your TeX file (between \documentclass{} and \begin{document}):
Create A PDF without page numbering
Put the following in the preamble of your TeX file:
1 \thispagestyle{empty}