<> = Points of Interests = The following is a list of common steps executed by ''pdftotext'' to extract text from a PDF file, and in which file the corresponding code is located. Note that the stated locations refer to commit [[https://github.com/freedesktop/poppler/tree/065dca3816db3979dfacdc2f8592abed2ff6859a|065dca3]] and may have changed by now. * '''Opening and reading the PDF file''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L144|PDFDoc::PDFDoc(), line 144ff]] * '''Parsing the PDF version number from the PDF file header''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L350|PDFDoc::checkHeader(), line 350]] * '''Parsing startxref''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L1999|PDFDoc::getStartXRef(), line 1999ff]] * '''Parsing the xref table and the trailer dictionary''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/XRef.cc#L535|XRef::readXRefTable(), line 535]] * '''Parsing the document catalog''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L76|Catalog::Catalog(), line 76]] * '''Extracting the PDF subtype''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/PDFDoc.cc#L516|PDFDoc::extractPDFSubtype(), line 516]] * '''Parsing a page (i.e., a page object node from the page tree)''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Catalog.cc#L315|Catalog::cachePageTree(), line 315]] * '''Parsing the commands of a page's content stream''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L644|Gfx::display(), line 644]] * '''Executing the commands of a page's content stream''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L650|Gfx::go(), line 650]] * Executing a `BT` command (begin text) <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3585|Gfx::opBeginText(), line 3585]] * Executing a `Tj` command (showing text) <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3709|Gfx::opShowText(), line 3709]] * Executing a `TJ` command (showing text with individual glyph positioning) <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3779|Gfx::opShowSpaceText(), line 3779]] * Executing a `'` command (moving to next line and showing text) <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3727|Gfx::opMoveShowText(), line 3727]] * Executing a `"` command (moving to next line, setting the word and character spacing, and showing text) <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3751|Gfx::opMoveSetShowText(), line 3751]] * Executing a `Tf` command (setting font) <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3610|Gfx::opSetFont(), line 3610]] * Executing a `ET` command (end text) <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/Gfx.cc#L3595|Gfx::opEndText(), line 3595]] * '''Combining diacritical marks''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/TextOutputDev.cc#L2699|TextPage::addChar(), line 2699]] * '''Detecting words''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/TextOutputDev.cc#L2648|TextPage::addChar(), line 2648]] * '''Detecting text blocks''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/TextOutputDev.cc#L3017|TextPage::coalesce(), line 3017]] * '''Detecting text lines in each text block''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/TextOutputDev.cc#L1725|TextBlock::coalesce(), line 1725]] * '''Detecting tables''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/TextOutputDev.cc#L3487|TextPage::coalesce, line 3487ff]] * '''Detecting reading order''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/TextOutputDev.cc#L3720|TextPage::coalesce, line 3720ff]] * '''Detecting text flows''' <
>    [[https://github.com/freedesktop/poppler/blob/065dca3816db3979dfacdc2f8592abed2ff6859a/poppler/TextOutputDev.cc#L3753|TextPage::coalesce, line 3753ff]] = HOWTOs = == Create a PDF with human-readable objects + content streams == Put the following in the preamble of your TeX file (between `\documentclass{}` and `\begin{document}`): {{{#!highlight tex \pdfobjcompresslevel=0 \pdfcompresslevel=0 }}} == Create a PDF with specified crop box == Put the following in the preamble of your TeX file (between `\documentclass{}` and `\begin{document}`): {{{#!highlight tex \pdfpageattr{ /CropBox [50 50 100 100] } }}} == Create A PDF without page numbering == Put the following in the preamble of your TeX file: {{{#!highlight tex \thispagestyle{empty} }}}