Extracting Text From Images Using PyTesseract
As a developer, you might want to extract textual information from an image. Using Python, we can create a program that extracts such textual data from any given image. <!--more--> Python has been one of the most popular languages developers enjoy working with. Its human-readable syntax makes it easy to learn.
In this guide, we will write a Python script that extracts images, scans for text, transcribes it, and saves it to a text file. We will use the Python tesseract library to recognize textual data from images.
Table of contents
- Table of contents
- Prerequisities
- Setting up tesseract OCR
- Adding the project dependencies
- Create a Python tesseract script
- Extract images
- Extract text information
- Test the app
- Conclusion
Prerequisities
To follow along with this article, ensure that you have Python installed and running on your computer.
Also, ensure you have some basic understanding of Python.
Setting up tesseract OCR
Optical Character Recognition (OCR) is a technology that is used to recognize text from images. It can be used to convert tight handwritten or printed texts into machine-readable texts.
To use OCR, you need to install and configure tesseract on your computer.
First, download the Tesseract OCR executables here. While installing this executable, make sure you copy the tesseract installation path and add it to your system environment varibales.
Once the process is done, run the tesseract -v
command to verify that the OCR is installed.
To test whether this environment is working, you may run OCR on any image and see if the textual data gets extracted and saved in a readable text file.
To do that, ensure you have an image with textual information. Use your command line to navigate to the image location and run the following tesseract
command:
tesseract <image_name> <file_name_to_save_extracted_text>
In this case, you will provide the image name and the file name. When the command is executed, a .txt
file will be created and saved in the same folder.
This confirms that the tesseract library is successfully installed. We may now proceed to implement the same using a Python script.
Adding the project dependencies
We need to install a few dependent libraries to help us get started with the Python script.
Pytesseract
Python-tesseract is an OCR library that is used to scan and transcribe any textual data in images. This library is used to recognize textual information but not to save it to any text document.
To install pytesseract
, run the following command:
pip install pytesseract
PyMuPDF
PyMuPDF is a python library that is used to access file documents and images, such as PDFs.
In this application, PyMuPDF will read PDF documents and check for any saved images. PyMuPDF renders the PDF files into PNG formats, scans for any text, and finally extracts the text from the rendered PNG images.
To install PyMuPDF
, run the following command:
pip install PyMuPDF
Pillow
Pillow library acts as an image interpreter with all image processing capabilities.
To install pillow, run the following command:
pip install pillow
Opencv-python
Opencv-python is used to read images and videos, manipulate media files with image transformations, draw shapes, and put text on those files.
We will use OpenCV to recognize texts from the media files (images).
To install opencv-python, run the following command:
pip install opencv-python
Create a Python tesseract script
Create a project folder and add a new main.py
file inside that folder.
First, we need to import these library dependencies that we installed. Add the following imports inside the main.py
file:
import os # from native modules
import fitz # from PyMuPDF
import pytesseract # from pytesseract
import cv2 # from Opencv
import io # from native modules
from PIL import Image, ImageFile # from Pillow
from colorama import Fore # from native modules
import platform # from native modules
Then, allow this application to process the image files:
ImageFile.LOAD_TRUNCATED_IMAGES = True
Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.
In this case, we need to create a few global variables that help to create and save these images to the project path. We also specify the path to save the extracted text into a .txt
file.
Go ahead and add these global variables as shown:
# Global variables
strPDF, textScanned, textScanned, inputTeEx, dirName = "", "", "", "", [
"images", "output_txt"]
This will create a directory images
where the PDF extracted images will be saved. An output_txt
directory will be created to save the scanned text information as .txt
file.
Now, let's create the method that helps us access the installed tesseract library, and the required files. We will do this under gInUs()
function as shown:
def gInUs():
# Global var
global strPDF
global inputTeEx
if(platform.system() == "Windows"):
# Print input
print(Fore.YELLOW +
"[.] Add the tesseract.exe local path" + Fore.RESET)
inputTeEx = input()
# Print input
print(Fore.GREEN + "[!] Add the PDF file local path:" + Fore.RESET)
inputUser = input()
# Print an alert if input is not valid, if not, call to fun reDoc
if(inputUser == "" or len(inputUser.split("\\")) == 1):
print(Fore.RED + "[X] Please enter a valid PATH to a file" + Fore.RESET)
else:
extIm(inputUser)
From the code above:
"[.] Add the tesseract.exe local path"
- it helps us access the tesseract library."[!] Add the PDF file local path:"
- it helps us access the local PDF file we want to use.
Once we enter this path, we need first to verify whether the file path is correct. If the path is incorrect, the application will display Please enter a valid PATH to a file
error message. If the path is correct, the application will extract text from the images by executing the extIm()
method.
Extract images
Once we have the correct PDF file path, we need to run the file and extract the text to the .txt
file.
First, we need to open the text file and read its contents. To do that, we will use the fitz
module as shown below:
# Extracting images
def extIm(fileStr):
# open the file
pdf_file = fitz.open(fileStr)
We create a path to save the images that we extract from the file:
global dirName
# Create output folder if don't exists
for i in dirName:
try:
os.makedirs(i)
print(Fore.GREEN + "[!] Directory ", i, " Created" + Fore.RESET)
except FileExistsError:
print(Fore.RED + "[X] Directory ", i,
" already exists" + Fore.RESET)
content = os.listdir("images")
We need to check if there are any images available in the folder. If so, list them and print the contents of each image as shown:
# List images if exists and print each one. if not extract all images
if(len(content) >= 1):
# Print every img in content
for i in content:
print(Fore.YELLOW + f"This is an image: {i}" + Fore.RESET)
else:
# Iterate over PDF pages
for page_index in range(len(pdf_file)):
# get the page itself
page = pdf_file[page_index]
image_list = page.getImageList()
If no images are available in the folder, we iterate over the PDF files and extract their contents.
Let's print the count of total images that we have extracted and display an error message if no image is found in the folder:
# printing number of images found on this page
if image_list:
print(
Fore.GREEN + f"[+] Found a total of {len(image_list)} images in page {page_index}" + Fore.RESET)
else:
print(Fore.RED + "[!] No images found on page",
page_index, Fore.RESET)
In the loop, we name every image that is generated from the PDF. Here, we will append the image count to the string image
. For example, image2_1
:
for (image_index, img) in enumerate(page.getImageList(), start=1):
# get the XREF of the image
xref = img[0]
# extract the image bytes
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
# get the image extension
image_ext = base_image["ext"]
# load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# save it to local disk
image.save(
open(f"images/image{page_index+1}_{image_index}.{image_ext}", "wb"))
reImg()
Here, we execute the function reImg()
to render these images and extract their content. Let's do this in the next step.
Extract text information
Let's create a function named reImg()
to hold these global variables:
def reImg():
# Global var
global textScanned
global dirName
global inputTeEx
At this point, we will have to access the tesseract.exe
file. To do that, we use the global variable inputTeEx
, where we accept the file path from the user:
pytesseract.pytesseract.tesseract_cmd = f"{inputTeEx}"
Python will use the pytesseract
module to access the tesseract through the cmd
.
We need to loop through each extracted images and read its content to extract textual information as shown:
# List the images
content = os.listdir('images')
for i in range(len(content)):
# Reading each image in images
image = cv2.imread(f'images/{content[i]}')
# Scan text from image
print(Fore.YELLOW + f"[.] Scan text from {content[i]}" + Fore.RESET)
text = pytesseract.image_to_string(image, lang='spa')
# Concate text scanned in a string
textScanned += text
# print
print(Fore.GREEN + "[!] Finished scan text" + Fore.RESET)
# Showing img input
cv2.imshow('Image', image)
# 0.5 milisecond
cv2.waitKey(1000)
# Create and write file txtResult.txt
print(Fore.CYAN + "[.] Writing txtResult.txt" + Fore.RESET)
fileTxt = open(f"{dirName[1]}/txtResult.txt", "w")
fileTxt.write(textScanned)
print(Fore.GREEN + "[!] File Writted" + Fore.RESET)
Finally, call the gInUs()
function to execute the program:
# Call to main function
gInUs()
Test the app
To test the app, run python main.py
.
First provide the tesseract path and hit enter:
> [!] Add the tesseract.exe local path
Once you hit enter, you will be instructed to add the PDF path:
> [!] Add the PDF file local path
On execution, the program creates an output_txt
folder to save the extracted text information in .txt
files.
Conclusion
In this guide, we created a Python script that extracts textual information from the images by scanning, transcribing, and saving it to a text file. You can get the code used in this guide on GitHub.
I hope you found this tutorial helpful.
Clap 👏 If this article helps you.
Peer Review Contributions by: Srishilesh P S