Search Blog

INSIGHTS

10 min read

by Suryakanta Mohapatra

Published on 05/26/2021

Last updated on 03/21/2024

Published on 05/26/2021

Last updated on 03/21/2024

Protecting Personal Identifiable Information with LexNLP

Subscribe to

the Shift!

Get emerging insights on emerging technology straight to your inbox.

It is no wonder companies are taking stringent measures to make sure they are fully compliant towards EU’s General Data Protection Regulations (GDPR) which protects privacy of Personally Identifiable Information (PII) of EU residents, a lot more stricter regulations are coming sooner than later such as California Privacy Right Act in 2023, SAFE DATA Act by the end of 2021. Hence protecting PII is becoming a matter of paramount importance to businesses. On the other hand, it is quite burdensome for humans to verify each and every public medium at their disposal to check if it contains PII. Hence, in this article we will delve deeper to understand how to retrieve textual information from an image by Optical Character Recognition (OCR) and how to use a Natural Language Processing Library called LexNLP to process the extracted text to check if it contains any PII. This is the output of a recent hackathon I participated. So, to give a high-level overview, first we will walk through OCR, its application and at the end we will see how LexNLP helps to detect presence of PII. In order to tie everything together we’ll write a simple python script that helps extract text and checks for PII. All these are exposed via a Flask web application for the convenience of user interactivity.

Optical Character Recognition

Optical Character Recognition or OCR is a technology that helps detect the presence of textual information in an image and extract the machine encoded text from it which the computer understands. Detecting text information through automated processes is not as trivial as it appears to humans. Behind the scenes is a series of complicated processes involving image processing and implementation of other complex algorithms which finally extracts the text. To the computer the processed image contains only matrix of white and black dots. Extraction involves multiple phases such as Despeckle, Binarisation, Line removal, Layout analysis, script recognition, segmentation, and normalization, Matrix matching and post processing. We will keep these jargons out of scope of this article for simplicity. Here in this article, we will stick to PyTesseract library to retrieve text from an image. This is a wrapper for Google’s Tesseract-OCR engine. There are other libraries also that can be used such as Pyocr.

Natural Language Processing

Although this is a vast field in itself which primarily involves with interaction between computers and human language in order to process and analyze large amount of natural language data, here we will use this technology via an open-source library called LexNLP to extract PII from a textual content.

Technology stack we will use

For this project, we are going to use the following

PyTesseract for Optical Character Recognition
Flask web framework for OCR server
Pillow library for image manipulation
LexNLP for extracting PII

Getting our hands dirty

Enough of this explanations, now let’s build the real thing which in layman’s term will provide us with a web page that allows a user to upload an image, snapshot or scanned photo to the server and get back information from the image along with any PII that may be present. Initially we are going to install all the prerequisite for this project. Then we will develop the flask framework needed for ease of user interaction. We will use pip to install software packages. Install pip by following the below steps: Manually download the get-pip.py or use curl command to do so. curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py Then, navigate to the folder where get-pip.py is downloaded and run the following command to install pip python get-pip.py Now we are ready to install pipenv using the following command pip install pipenv pipenv creates and manages virtual environment for our project. Now, since we have pipenv, let’s create a directory and kick-start by following command. We use python3 system link here. mkdir lexnlp-extraction && cd lexnlp-extraction && pipenv install –three Activate virtual environment by the command pipenv shell

Now we can install all packages/ dependencies by pip install command. As we know we have dependencies on pytesseract for OCR functions and Pillow for image manipulation, lets install both of them now.

pipenv install pytesseract Pillow The most important dependency to install is lexNLP which provides the core functionality to fetch the PII from the supplied text. Follow the below steps to install lexNLP: Clone the lexNLP git repo to your local folder and install by pipenv install command

git clone https://github.com/LexPredict/lexpredict-lexnlp.git
cd lexpredict-lexnlp
pipenv install

The above steps install lexNLP library which provides arrays of features, but we will focus on the lexNLP-PII feature.

And the last but not the least is to install flask framework. Run the following command: pipenv install flask As we have installed all the prerequisites, now it’s high time we create the necessary scripts. Should you need to learn flask before moving ahead feel free to visit the quickstart guide. Before writing the scripts, let us see how the framework layout looks like:

app.py : Kick starts flask server and contains necessary routes.
ocr_extraction.py : Handles the extraction of text from the image
lexnlp_extraction.py : This file handles the extraction of PII
template/ : Folder contains all the html files
static/uploads/ : Folder contains all uploaded images
index.html : Starting page with which the application starts
upload.html : Lets user upload the image file and shows result

Let’s define the content of ocr_extraction.py:

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

def ocr_extraction(filename):
    """
    This function will handle the core OCR processing of images.
    """
    text = pytesseract.image_to_string(Image.open(filename))
    return text

The above script takes charge of opening the image by using Image class of Pillow library and then extracts the text by using the image_to_string() function of pytesseract.

lexnlp_extraction.py is another file which defines a method to extracts the list of PII from the supplied text.

import lexnlp.extract.en.pii

def extract_pii(input_string):
    return list(lexnlp.extract.en.pii.get_pii(input_string))

app.py is the file which literally starts the flask application. Here is the code.

import os
from flask import Flask, render_template, request
from ocr_extraction import ocr_extraction,pdf_extract
from lexnlp_extraction import extract_pii


# define folder to save the uploaded image
UPLOAD_FOLDER = 'static/uploads/'

# Allowed file image file extension type
ALLOWED_EXTENSIONS = set(['png', 'jpg', 'jpeg'])

app = Flask(__name__)

# validates file extension
def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

# route and function to handle the home page
@app.route('/')
def home_page():
    return render_template('index.html')

# route and function to handle the upload page
@app.route('/upload', methods=['GET', 'POST'])
def upload_page():
    if request.method == 'POST':
        # check if there is a file in the request
        if 'file' not in request.files:
            return render_template('upload.html', msg='No file selected')
        file = request.files.get('file')
        # if no file is selected
        if file.filename == '':
            return render_template('upload.html', msg='No file selected')

        if file and allowed_file(file.filename):
            file.save(os.path.join(UPLOAD_FOLDER,file.filename))
            # OCR function extracts text
            extracted_text = ocr_extraction(file)
            # LexNLP extracts list of PIIs of possible different category
            pii = ", ".join(map(str, extract_pii(extracted_text)))
            # Sends the OCR extracted and LexMLP extracted texts
            return render_template('upload.html',
                                   msg='Successfully processed',
                                   extracted_text=extracted_text,pii_text=pii,
                                   img_src=UPLOAD_FOLDER + file.filename)
        else:
            return render_template('upload.html', msg='Please enter correct file form')
    elif request.method == 'GET':
        return render_template('upload.html')

if __name__ == '__main__':
    app.run()

The upload_page() functions is called when image is uploaded from the HTML page. And the uploaded image is stored in the static/uploads/ folder. Similarly, the HTML files are stored in templates/ folder. So, we have to manually create both these folders and keep the HTML (shown below) files in the template/ folder. The index.html file is pointed by default(basic route) as the home page. index.html has hyperlink for upload.html. Below is the code.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <title>Index</title>
  <style type="text/css">
    span {font-size: 1.6em;}
  </style>
</head>
<body>
<H1 ><center>Welcome to LexNLP Demo</center></span></p>
<h3><center>This demo shows how to extract Personal Identifiable Information from an Image file</center></h3>
<h4><center>Please click the below link to upload the image file(png/jpg/jpeg)</center></h3>
<p font-><center><a href="http://127.0.0.1:5000/upload"><span>upload</span></a></center></p>
</body>
</html>

And at last, the upload.html is responsible for submitting the image via POST method and renders the result/ response from app.py. Below is the code:

<html>
 <head><title>Upload Image</title></head>
 <body>
  <center>
   {% if msg %}
   <p class = "p3">{{ msg }}</p>
   {% endif %}
   <h1>Upload new File</h1>
   <form method=post enctype=multipart/form-data>
     <p><input type=file name=file> <input type=submit value=Upload>
   </form>
   <h1>Result:</h1>
   {% if img_src %}
     <img height = 400 width = 300 border=1 src="{{ img_src }}">
   {% endif %}
   {% if extracted_text %}
     <p class = "p2"> The extracted text from the image above is: <br>
       <b><i> {{ extracted_text }} </i></b>
     </p>
   {% else %}
     The extracted text will be displayed here
   {% endif %}
   {% if pii_text %}
     <p  class = "p1"> The extracted Personal Identifiable texts from the uploaded image  is: <br>
      <b><i><red> {{ pii_text }} </red></i></b>
    </p>
   {% endif %}
  </center>
 </body>
</html>

Now we are ready to run the app. All we have to do is go to the virtual environment in the same directory by running the command:

pipenv shell
flask run

The server should start with the following message:

* Environment: production

WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.

* Debug mode: off

* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Go to the browser and load the above displayed URL(127.0.0.1:5000) and we should see the following page

When clicked upload link, it should show the following page:

In the upload page, we have to upload an image which we intend to fetch the text and PII from by clicking choose File button and then selecting the file in local folder and then we need to click upload button, which shows the extracted text and PII. Below are the results when tried with different images such as color handwritten scanned image, scanned Black and White handwritten image and digitally written black and white screenshot. Result for Screenshot Image of digitally written document

The above result is for screenshot image of digitally written document, which fetches the result(both text and PII such as SSN and Phone) accurately. Result for Black and White scanned Image of handwritten document

The above result is for black and white scanned image of handwritten document, which poorly fetched the SSN where 1 is misinterpreted as ‘(‘ and hence no PII was detected. Result for Color scanned Image of handwritten document

The above result is for color scanned image of handwritten document, which is quite similar to the previous one, the difference being "Live en" vs "Live tn". Moreover, it is worth the effort to contribute to LexNLP to include information like medical record, tax record ..etc in their PII scope. The chances of getting better accuracy in fetching text is highly dependent on how better contrast does the image have. This area definitely needs further analysis and testing. The primary purpose of this article is to bring and demonstrate my learning from a hackathon which may help the beginners trying to get into this field.

Conclusion

Although we have achieved a lot in this project, still we could have included PDF file in the text extraction process for which we could use PyPDF2 library and rest of the process works same. Pytesseract and LexNLP are great opens source libraries for OCR and PII detection which would be greatly useful in multiple use cases to make sure PII privacy is well complied. The source code of the project can be accessed at GitHub

Reference and Credits

Robley Gori — PyTesseract — Simple Python Optical Character Recognition
Extracting Personally-Identifiable Information(PII)
GitHub Tesseract
https://pypi.org/project/pytesseract/
Word cloud from https://www.wordclouds.com/
Icons in the cover image from https://icons8.com/

Subscribe to

the Shift!

Get emerging insights on emerging technology straight to your inbox.

Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach

Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.

Download

Insights

How to continually evaluate your security posture

Security Panoptica CNAPP

Product

Upgrade your network devices using SWIM Flexible Device Ordering in Cisco Catalyst Center

Use Case Security

Inside Outshift

Product management in a dynamic cybersecurity landscape with Ryan Delany

CNAPP Team Panoptica Security

Subscribe  to

the Shift

Get

emerging insights

on emerging technology straight to your inbox.

The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.

Insights

Inside Outshift

Collaborations

Product

Categories

Search Blog

by Suryakanta Mohapatra

Published on 05/26/2021

Last updated on 03/21/2024

Published on 05/26/2021

Last updated on 03/21/2024

Protecting Personal Identifiable Information with LexNLP

Get emerging insights on emerging technology straight to your inbox.

Optical Character Recognition

Natural Language Processing

Technology stack we will use

Getting our hands dirty

Conclusion

Reference and Credits

Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach

Related articles

Insights

How to continually evaluate your security posture

Product

Upgrade your network devices using SWIM Flexible Device Ordering in Cisco Catalyst Center

Inside Outshift

Product management in a dynamic cybersecurity landscape with Ryan Delany