如何使用python从文件夹中的pdf中提取文本并将其保存在dataframe中? [英] How to extract text from pdfs in folders with python and save them in dataframe?

查看:90
本文介绍了如何使用python从文件夹中的pdf中提取文本并将其保存在dataframe中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多文件夹,每个文件夹中都有几个pdf文件(也有其他文件类型,例如.xlsx或.doc).我的目标是为每个文件夹提取pdf文本并创建一个数据框,其中每个记录都是文件夹名称".并且每一列以字符串形式表示该文件夹中每个pdf文件的文本内容.

I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.

我设法用 tika 包(下面的代码)从一个pdf文件中提取了文本.但是无法循环访问该文件夹或其他文件夹中的其他pdf,从而构造出结构化的数据框.

I managed to extract text from one pdf file with tika package (code below). But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.

# import parser object from tike 
from tika import parser   
  
# opening pdf file 
parsed_pdf = parser.from_file("ducument_1.pdf") 
  
# saving content of pdf 
# you can also bring text only, by parsed_pdf['text']  
# parsed_pdf['content'] returns string  
data = parsed_pdf['content']  
  
# Printing of content  
print(data) 
  
# <class 'str'> 
print(type(data))

所需的输出应如下所示:

The desired output should look like this:

<身体>
Folder_Name pdf1 pdf2
17534 pdf1的文本 pdf 2的文本
63546 pdf1的文本 pdf1的文本
26374 pdf1的文本-

推荐答案

如果要查找目录及其子目录中的所有PDF,可以使用 os.listdir glob ,请参见递归子文件夹搜索并返回列表python 中的文件.我的表格稍长一些,因此更容易了解初学者的情况

If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir and glob, see Recursive sub folder search and return files in a list python . I've gone for a slightly longer form so it is easier to follow what is happening for beginners

然后,对于每个文件,调用Apache Tika,并保存到Pandas DataFrame中的下一行

Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame

#!/usr/bin/python3

import os, glob
from tika import parser 
from pandas import DataFrame

# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."

# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(PATH):
    files += glob.glob(os.path.join(dirpath, ext))

# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))

# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
   data = parser.from_file(filename)
   text = data["content"]
   df.loc[idx] = [filename, text]

# For debugging, print what we found
print(df)

这篇关于如何使用python从文件夹中的pdf中提取文本并将其保存在dataframe中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆