How To Extract Text From Pdf In Python

This example will show you how to use the python modules PyPDF2, textract, and nltk to extract text from a pdf format file.

1. Install Python Modules PyPDF2, textract, and nltk.

  1. Open a terminal and run the below command to install the above python library.
    pip install PyPDF2
    pip install textract
    pip install nltk
  2. When installing textract, you may encounter the below error message. That means the swig is not installed in your os, you can refer to How To Install Swig On macOS, Linux, And Windows to learn more.
    unable to execute 'swig': No such file or directory
    This is because the textract installation need swig module installed.
    so run below command first to install swig.
    unable to execute 'swig': No such file or directory
    

2. Python PDF Text Extract Example.

  1. Open eclipse and create a PyDev project PythonExampleProject. You can refer to How To Run Python In Eclipse With PyDev
  2. Create a python module com.dev2qa.example.file.PDFExtract.py.
  3. Copy and paste the below python code in the above file. There are two functions in this file, the first function is used to extract pdf text, the second function is used to split the text into keyword tokens and remove stop words and punctuations.
    '''
    This example tell you how to extract text content from a pdf file.
    '''
    
    import PyPDF2
    import textract
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    
    # This function will extract and return the pdf file text content.
    def extractPdfText(filePath=''):
    
        # Open the pdf file in read binary mode.
        fileObject = open(filePath, 'rb')
    
        # Create a pdf reader .
        pdfFileReader = PyPDF2.PdfFileReader(fileObject)
    
        # Get total pdf page number.
        totalPageNumber = pdfFileReader.numPages
    
        # Print pdf total page number.
        print('This pdf file contains totally ' + str(totalPageNumber) + ' pages.')
    
        currentPageNumber = 0
        text = ''
    
        # Loop in all the pdf pages.
        while(currentPageNumber < totalPageNumber ):
    
            # Get the specified pdf page object.
            pdfPage = pdfFileReader.getPage(currentPageNumber)
    
            # Get pdf page text.
            text = text + pdfPage.extractText()
    
            # Process next page.
            currentPageNumber += 1
    
        if(text == ''):
            # If can not extract text then use ocr lib to extract the scanned pdf file.
            text = textract.process(filePath, method='tesseract', encoding='utf-8')
           
        return text
    
    # This function will remove all stop words and punctuations in the text and return a list of keywords.
    def extractKeywords(text):
        # Split the text words into tokens
        wordTokens = word_tokenize(text)
    
        # Remove blow punctuation in the list.
        punctuations = ['(',')',';',':','[',']',',']
    
        # Get all stop words in english.
        stopWords = stopwords.words('english')
    
        # Below list comprehension will return only keywords tha are not in stop words and  punctuations
        keywords = [word for word in wordTokens if not word in stopWords and not word in punctuations]
       
        return keywords
    
    if __name__ == '__main__': 
    
        pdfFilePath = '/Users/zhaosong/Documents/WorkSpace/e-book/Mastering-Node.js.pdf'
       
        pdfText = extractPdfText(pdfFilePath)
        print('There are ' + str(pdfText.__len__()) + ' word in the pdf file.')
        #print(pdfText)
    
        keywords = extractKeywords(pdfText)
        print('There are ' + str(keywords.__len__()) + ' keyword in the pdf file.')
        #print(keywords)    
    
  4. Right-click the source code and click Run As —> Python Run menu item. Then you can get the below output in the eclipse console.
    PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
    This pdf file contains totally 347 pages.
    There are 481318 word in the pdf file.
    There are 53212 keyword in the pdf file.
    
    

3. Extract PDF Text Example Execution Error Fix.

  1. When you run the example you may encounter some errors, below will list all the errors and how to fix them.

3.1 nltk punkt not found error.

  1. This error occurs when import nltk.tokenize.word_tokenize.
  2. Below is the error message.
    LookupError:
    **********************************************************************
      Resource [93mpunkt[0m not found.
      Please use the NLTK Downloader to obtain the resource:
      [31m>>> import nltk
      >>> nltk.download('punkt')
      [0m
      Searched in:
        - '/Users/zhaosong/nltk_data'
        - '/usr/share/nltk_data'
        - '/usr/local/share/nltk_data'
        - '/usr/lib/nltk_data'
        - '/usr/local/lib/nltk_data'
        - '/Library/Frameworks/Python.framework/Versions/3.6/nltk_data'
        - '/Library/Frameworks/Python.framework/Versions/3.6/share/nltk_data'
        - '/Library/Frameworks/Python.framework/Versions/3.6/lib/nltk_data'
        - ''
    **********************************************************************
  3. when seeing the above error message, run the below command in a terminal to download nltk punkt.
    >>> import nltk
    >>> nltk.download('punkt')
    [nltk_data] Downloading package punkt to /Users/zhaosong/nltk_data...
    [nltk_data]   Unzipping tokenizers/punkt.zip.
    True

3.2 nltk stopwords not found error.

  1. This error occurs when import nltk.corpus.stopwords.
  2. Below is the error message.
      Please use the NLTK Downloader to obtain the resource:
      [31m>>> import nltk
      >>> nltk.download('stopwords')
      [0m
      Searched in:
        - '/Users/zhaosong/nltk_data'
        - '/usr/share/nltk_data'
        - '/usr/local/share/nltk_data'
        - '/usr/lib/nltk_data'
        - '/usr/local/lib/nltk_data'
        - '/Library/Frameworks/Python.framework/Versions/3.6/nltk_data'
        - '/Library/Frameworks/Python.framework/Versions/3.6/share/nltk_data'
        - '/Library/Frameworks/Python.framework/Versions/3.6/lib/nltk_data'
    **********************************************************************
  3. Run the below commands to fix the error.
    >>> import nltk
    >>> nltk.download('stopwords')
    [nltk_data] Downloading package stopwords to
    [nltk_data]     /Users/zhaosong/nltk_data...
    [nltk_data]   Unzipping corpora/stopwords.zip.
    True

2 thoughts on “How To Extract Text From Pdf In Python”

  1. i am getting the error after using the same code and procedure .(invalid literal for int() with base 10: b”)
    can you plz help me out as soon as possible.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.