Reading specific page from PDF file in Python
Document processing is one of the most common use cases for the Python programming language. This allows the language to handle many files such as database files, media files, and encrypted files, just to name a few. This article will teach you how to read a specific page from a PDF (Portable Document Format) file in Python.
Method 1: Using the Pymupdf Library to Read a Page in Python
In this article, PIL (Python Imaging Library) will be used for PDF processing along with the PyMuPDF library. To install the PyMuPDF library, run the following command in the operating system shell:
pip install pymupdf
Note. This PyMuPDF library is imported using the following command.
import fitz
Reading a page from a pdf requires loading it and then displaying the contents of only one of its pages. Essentially, this makes the one-page equivalent of an image. Therefore, the page from the pdf file will be read and displayed as an image.
The following example demonstrates the above process:
import fitz
from PIL import Image
# Path of the PDF file
input_file = r"test.pdf"
# Opening the PDF file and creating a handle for it
file_handle = fitz.open(input_file)
# The page no. denoted by the index would be loaded
# The index within the square brackets is the page number
page = file_handle[]
# Obtaining the pixelmap of the page
page_img = page.get_pixmap()
# Saving the pixelmap into a png image file
page_img.save('PDF_page.png')
# Reading the PNG image file using pillow
img = Image.open('PDF_page.png')
# Displaying the png image file using an image viewer
img.show()
Explanation:
First, the PDF file is opened and its file descriptor is stored. Then the first page of the pdf (with index 0) is loaded using list indexing. The pixel map of this page (an array of pixels) is obtained using the get_pixmap function, and the resulting pixel map is stored in a variable. This pixmap is then saved as a png image file. This png file is then opened using the open function present in the Image PIL module. At the end, the image is displayed using the show function.
Note. The first open function is used to open a PDF file and the last one is used to open a png image file. The functions belong to different libraries and are used for different purposes.
Method 2: Reading specific page from PDF using PyPDF2
The second example will use the PyPDF2 library. Which can be installed by running the following command:
pip install PyPDF2
The same goal can be achieved using the PyPDF2 library. The library allows you to process PDF files and allows you to perform various operations such as reading, writing or creating a PDF file. For the task at hand, the text extraction function will be used to get the text from the PDF file and display it. The code for this looks like this:
# importing required modules
import PyPDF2
input_file = r"test.pdf"
page = 4
# Creating a pdf file object
pdfFileObj = open('test.pdf', 'rb')
# Creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# Creating a page object
pageObj = pdfReader.getPage(page)
# Extracting text from page
data = pageObj.extractText()
# Closing the pdf file object
pdfFileObj.close()
print(data)
output:
He started this Journey with just one
thought- every geek should have
access to a never ending range of
academic resources and with a lot
of hardwork and determination,
GeeksforGeeks was born.
Through this platform, he has
successfully enriched the minds of
students with knowledge which has
led to a boost in their careers. But
most importantly, GeeksforGeeks
will always help students stay in
touch with their Geeky side!
EXPERT ADVICE
CEO and Founder of
GeeksforGeeks
I understand that many
students who come to us are
either fans of the sciences or
have been pushed into this
feild by their parents.
And I just want you to
know that no matter
where life takes you, we
at GeeksforGeeks hope
to have made this
journey easier for
you.Mr. Sandeep Jain
3
Explanation:
First, the path to the input PDF file and the page number are defined in separate variables. The PDF file is then opened and its file object is stored in a variable. This variable is then passed as an argument to the PdfFileReader function, which creates a PDF reader object from the file object. Then the data stored within the page number defined in the page variable is retrieved and stored in the variable. The text is then extracted from that PDF page and the file object is closed. At the end, the extracted text data is displayed.