Reading specific page from PDF file in Python

 Reading specific page from PDF file in Python

Reading specific page from PDF file in Python

Document processing is one of the most common use cases for the Python programming language. This allows the language to handle many files such as database files, media files, and encrypted files, just to name a few. This article will teach you how to read a specific page from a PDF (Portable Document Format) file in Python.

Method 1: Using the Pymupdf Library to Read a Page in Python

In this article, PIL (Python Imaging Library) will be used for PDF processing along with the PyMuPDF library. To install the PyMuPDF library, run the following command in the operating system shell:

pip install pymupdf

Note. This PyMuPDF library is imported using the following command.

import fitz

Reading a page from a pdf requires loading it and then displaying the contents of only one of its pages. Essentially, this makes the one-page equivalent of an image. Therefore, the page from the pdf file will be read and displayed as an image.


The following example demonstrates the above process:


import fitz

from PIL import Image

 

# Path of the PDF file

input_file = r"test.pdf"

 

# Opening the PDF file and creating a handle for it

file_handle = fitz.open(input_file)

 

# The page no. denoted by the index would be loaded

# The index within the square brackets is the page number

page = file_handle[]

 

# Obtaining the pixelmap of the page

page_img = page.get_pixmap()

 

# Saving the pixelmap into a png image file

page_img.save('PDF_page.png')

 

# Reading the PNG image file using pillow

img = Image.open('PDF_page.png')

 

# Displaying the png image file using an image viewer

img.show()

Explanation:

First, the PDF file is opened and its file descriptor is stored. Then the first page of the pdf (with index 0) is loaded using list indexing. The pixel map of this page (an array of pixels) is obtained using the get_pixmap function, and the resulting pixel map is stored in a variable. This pixmap is then saved as a png image file. This png file is then opened using the open function present in the Image PIL module. At the end, the image is displayed using the show function.


Note. The first open function is used to open a PDF file and the last one is used to open a png image file. The functions belong to different libraries and are used for different purposes.


Method 2: Reading specific page from PDF using PyPDF2

The second example will use the PyPDF2 library. Which can be installed by running the following command:


pip install PyPDF2

The same goal can be achieved using the PyPDF2 library. The library allows you to process PDF files and allows you to perform various operations such as reading, writing or creating a PDF file. For the task at hand, the text extraction function will be used to get the text from the PDF file and display it. The code for this looks like this:

# importing required modules

import PyPDF2

   

input_file = r"test.pdf"

 

page = 4

 

# Creating a pdf file object

pdfFileObj = open('test.pdf', 'rb')

   

# Creating a pdf reader object

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

   

# Creating a page object

pageObj = pdfReader.getPage(page)

   

# Extracting text from page

data = pageObj.extractText()

 

# Closing the pdf file object

pdfFileObj.close()

 

print(data)

output:

He started this Journey with just one 

thought- every geek should have 

access to a never ending range of 

academic resources and with a lot 

of hardwork and determination, 

GeeksforGeeks was born.

Through this platform, he has        

successfully enriched the minds of 

students with knowledge which has 

led to a boost in their careers. But 

most importantly, GeeksforGeeks 

will always help students stay in 

touch with their Geeky side!

EXPERT ADVICE

CEO and Founder of 

GeeksforGeeks

                  I understand that many 

students who come to us are 

either fans of the sciences or 

have been pushed into this 

feild by their parents.

And I just want you to 

know that no matter 

where life takes you, we 

at GeeksforGeeks hope 

to have made this 

journey easier for  

you.Mr. Sandeep Jain

3

Explanation:

First, the path to the input PDF file and the page number are defined in separate variables. The PDF file is then opened and its file object is stored in a variable. This variable is then passed as an argument to the PdfFileReader function, which creates a PDF reader object from the file object. Then the data stored within the page number defined in the page variable is retrieved and stored in the variable. The text is then extracted from that PDF page and the file object is closed. At the end, the extracted text data is displayed.

Please leave your comment to encourage us

Previous Post Next Post