How To Extract Text From Pdf In Python

This example will show you how to use PyPDF2, textract and nltk python module to extract text from a pdf format file.

1. Install PyPDF2, textract and nltk Python Modules.

  1. Open a terminal and run below command to install above python library.
    pip install PyPDF2
    pip install textract
    pip install nltk
  2. When install textract, you may encounter below error message. That means the swig is not installed in your os, you can refer How To Install Swig On MacOS, Linux And Windows to learn more.
    unable to execute 'swig': No such file or directory
    This is because the textract installation need swig module installed.
    so run below command first to install swig.
    unable to execute 'swig': No such file or directory
    

2. Python PDF Text Extract Example.

python extract pdf text example project files

  1. Open eclipse and create a PyDev project PythonExampleProject. You can refer How To Run Python In Eclipse With PyDev
  2. Create a python module com.dev2qa.example.file.PDFExtract.py.
  3. Copy and paste below python code in above file. There are two functions in this file, the first function is used to extract pdf text, then second function is used to split the text into keyword tokens and remove stop words and punctuations.
    '''
    Created on Aug 10, 2018
    @author: zhaosong
    This example tell you how to extract text content from a pdf file.
    '''
    
    import PyPDF2
    import textract
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    
    # This function will extract and return the pdf file text content.
    def extractPdfText(filePath=''):
    
        # Open the pdf file in read binary mode.
        fileObject = open(filePath, 'rb')
    
        # Create a pdf reader .
        pdfFileReader = PyPDF2.PdfFileReader(fileObject)
    
        # Get total pdf page number.
        totalPageNumber = pdfFileReader.numPages
    
        # Print pdf total page number.
        print('This pdf file contains totally ' + str(totalPageNumber) + ' pages.')
    
        currentPageNumber = 
        text = ''
    
        # Loop in all the pdf pages.
        while(currentPageNumber < totalPageNumber ):
    
            # Get the specified pdf page object.
            pdfPage = pdfFileReader.getPage(currentPageNumber)
    
            # Get pdf page text.
            text = text + pdfPage.extractText()
    
            # Process next page.
            currentPageNumber += 1
    
        if(text == ''):
            # If can not extract text then use ocr lib to extract the scanned pdf file.
            text = textract.process(filePath, method='tesseract', encoding='utf-8')
           
        return text
    
    # This function will remove all stop words and punctuations in the text and return a list of keywords.
    def extractKeywords(text):
        # Split the text words into tokens
        wordTokens = word_tokenize(text)
    
        # Remove blow punctuation in the list.
        punctuations = ['(',')',';',':','[',']',',']
    
        # Get all stop words in english.
        stopWords = stopwords.words('english')
    
        # Below list comprehension will return only keywords tha are not in stop words and  punctuations
        keywords = [word for word in wordTokens if not word in stopWords and not word in punctuations]
       
        return keywords
    
    if __name__ == '__main__': 
    
        pdfFilePath = '/Users/zhaosong/Documents/WorkSpace/e-book/Mastering-Node.js.pdf'
       
        pdfText = extractPdfText(pdfFilePath)
        print('There are ' + str(pdfText.__len__()) + ' word in the pdf file.')
        #print(pdfText)
    
        keywords = extractKeywords(pdfText)
        print('There are ' + str(keywords.__len__()) + ' keyword in the pdf file.')
        #print(keywords)    
    
  4. Right click the source code and click Run As —> Python Run menu item. Then you can get below output in eclipse console.

    PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
    This pdf file contains totally 347 pages.
    There are 481318 word in the pdf file.
    There are 53212 keyword in the pdf file.
    
    

3. Extract PDF Text Example Execution Error Fix.

When you run the example you may encounter some errors, below will list all the errors and how to fix them.

READ :   Installed Python 3 On Mac OS X But It Still Use Python 2.7

3.1 nltk punkt not found error.

This error is occurred when import nltk.tokenize.word_tokenize.

LookupError:
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:
  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - '/Users/zhaosong/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/lib/nltk_data'
    - ''
**********************************************************************

when see above error message, run below command in a terminal to download nltk punkt.

>>> import nltk
>>> nltk.download('punkt')
[nltk_data] Downloading package punkt to /Users/zhaosong/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
True

3.2 nltk stopwords not found error.

This error is occurred when import nltk.corpus.stopwords.

  Please use the NLTK Downloader to obtain the resource:
  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Searched in:
    - '/Users/zhaosong/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.6/lib/nltk_data'
**********************************************************************

Run below commands to fix the error.

>>> import nltk
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/zhaosong/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True
(Visited 17 times, 1 visits today)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.