[Python] PDFファイルから文字抽出

「大量にPDFファイルがあり、そこから文字を抽出したい。」
そんなお悩みにPython（プログラム言語）でお答えします！

まずは、PDFの種類を確認し、それぞれに対応コードを例示します。

* 今回、構造化データは英語文書のみを対象としていますのでご注意ください。
* 構造化データを日本語対応にしたい場合は「pdfminer.six」モジュールの利用をお勧めします。

対応したいファイル
・パスワード付きのファイル
・その他読み込みが難しいファイル

Twitter:R@へDMいただけると助かります！

想定読者

・Pythonを使ったことがある人
・プログラム言語を書いても良いなと思う人

プログラムは意外と簡単なので、Anaconda Navigatorを使えるようにしてみてください。
下記のコードは、お手持ちのパソコンで全て実行可能です。スペックは不問です！

PDFファイルの種類

構造化データ

「文字がどこに配置されているか」が明確に決まっているPDFファイル。

非構造化データ

「文字がどこに配置されているか」が明確に決まっていないPDFファイル。画像を埋め込んだものが主。

1. 構造化データ

構造化データは、処理する方法が決まっています。

コマンドラインでPyPDF2のインストール

pip install PyPDF2

Jupyter NotebookまたはPython上でPython関数の定義

import PyPDF2
# 構造化データ読み込み
def open_pdf_text(file_name):
    text = ''
    with open(file_name, "rb") as f:
        reader = PyPDF2.PdfFileReader(f)
        for page_no in range(reader.numPages):
            page = reader.getPage(page_no)
            if page.extractText():
                text += page.extractText()
    return text

下記コードの実行

file_set = "PDFs/構造化データ.pdf"
open_pdf_text(file_set)

構造化データの場合、open_pdf_text関数からfile内の文字データが出力されます。
そうでない場合、空の出力（または改行のみ）が行われます。そのため、2番の非構造化データ抽出を行いましょう。

2. 非構造化データ

非構造化データの場合、画像として読み込んだ方が手っ取り早いです。
以降はPDFを画像化し、OCR（光学文字認識）で読み込む方法です。

TesseractとPyOCRのインストール

OCRをあなたのパソコン上で使うために、インストールを行います。
詳細はPythonでOCRを実行する方法のうち、TesseractのインストールとPyOCRのインストールをご覧ください。

PDFを画像化し、画像をOCRで読み込むPython関数の定義

PDFを画像化した際に保存しておくフォルダを「Image」とします。なければ作ってください。

import pathlib, pdf2image, glob, os, pyocr, PIL

# poppler/binを環境変数PATHに追加する
poppler_dir = os.path.join(pathlib.Path().resolve(), "src/poppler/bin")
os.environ["PATH"] += os.pathsep + str(poppler_dir)

# 画像読み込み
def open_pdf_image(file_name):
    ## pdfから画像化
    img_dir = pathlib.Path('Images')
    # 存在する画像ファイルを削除
    for p in glob.glob(os.path.join(img_dir, "*.png")):
        if os.path.isfile(p):
            os.remove(p)
    # 画像ファイル抽出
    images = pdf2image.convert_from_path(file_name, grayscale=True, size=1800)
    for index, image in enumerate(images):
        image.save(img_dir/pathlib.Path('{}.png'.format(index + 1)), 'png')
        
    ## 画像を加工したい場合はこちらに記述する

    ## ocr読み込み
    tools = pyocr.get_available_tools()
    tool = tools[0]
    builder = pyocr.builders.TextBuilder(tesseract_layout=6) # 1～6まで存在する
    # テキスト抽出
    text = ''
    for p in glob.glob(os.path.join(img_dir, "*.png")):
        if os.path.isfile(p):
            img = PIL.Image.open(p)
            text += tool.image_to_string(img, builder=builder)
    return text

エラーが出る場合は、上記と同じようにpipでインストールしてください。

実行は上記と同じように行います。

file_set = "PDFs/非構造化データ.pdf"
open_pdf_image(file_set)

画像を加工する方法

特定部分以外の文字が必要がない場合は、画像のトリミングを行います。
たとえば、下記のようなPDFの場合、赤枠内の文字しか使いません。

下記は例ですが、どのファイルにも応用が可能です。

## 画像ファイル加工
img_dir = pathlib.Path('Images')
for p in glob.glob(os.path.join(img_dir, "*.png")):
    if os.path.isfile(p):
        ## 画像1枚の加工
        img = PIL.Image.open(p)
        # 切り抜き版を(left, upper, right, lower)で指定
        img_crop = img.crop((160, 180, 1200, 1640))
        img_crop.save(p)

上記の画像ファイルは下記のように加工されます。

加工することで、無駄な情報がかなり減ります。
例として、加工前と加工後の文字を比較します。

画像加工後にも存在, 画像加工前のみ存在

Datum Blatt Anmelde-Nr:\nDate cf Form 1507 Sheet 1 ppelcation 15 836 530.4\nDate Feuille °° .\nThe examination is being carried out on the following application documents\nDescription, Pages\n1-16 filed with entry into the regional phase before the EPO\nClaims, Numbers\n1-6 filed with entry into the regional phase before the EPO\nDrawings, Sheets\nq1 filed with entry into the regional phase before the EPO\nD1 JP 2001 073094 A (SUMITOMO METAL IND) 21 March 2001\nD2 JP 2000 080450 A (SUMITOMO METAL IND) 21 March 2000\nD3 JP 2001 192788 A (SUMITOMO METAL IND) 17 July 2001\n\n1 Clarity and Conciseness (Art. 84 EPC)\n\n14 In order to meet the requirements following Article 84 EPC, the elemental\ncomposition of the non-oriented electrical steel sheet must be 100%\ndisclosed. All mandatory and optional elements as well as specified impurity\nlevels including their numerical ranges must be indicated in the main claim.\nOmission of elements and their ranges or partial disclosures allows the\npossibility that other elements in unspecified quantities may be included in the\nnon-oriented electrical steel sheet which may in turn have unforeseen effect\nupon the non-oriented electrical steel sheet. It is therefore, essential that the\nelemental ranges add up to 100% and an element to be given as a balance of\nthe composition.\n\nTaking that into account, the alloying elements disclosed in dependent\nclaims 2-5 should be included together with their corresponding ranges in the\nindependent product claim as optional elements, for a 100% disclosure of the\ncomposition.\n\n1.2 It appears from the description that the following features are also essential to\nthe definition of the invention:\n\nEPO Form 1703 01.91TRI

加工方法が決まった場合、PDFを画像化し、画像をOCRで読み込むPython関数の定義内の「上記の画像を加工したい場合はこちらに記述する」に記述すると画像加工処理を組み込むことができます。

最後に

Pythonにより、かなり簡単にPDFから文字抽出を行うことができたと思います。
ある程度文書構成が決まっていた場合には、文字の自動抽出を行うことができるようになるでしょう。

このような技術があれば、医療・医薬品業界や司法関連業務の半自動化が進むといいですね。
ヒトは解釈など機械にできないことに時間を使うべきであって、単純作業に時間を取られるべきではないです。

参考

PythonでPDFからテキストを読み取る方法について
 PythonでOCRを実行する方法
 Python, Pillowで画像の一部をトリミング（切り出し/切り抜き）