在python中使用制表器循环访问pdf文件 [英] looping through pdf files with tabulizer in python

查看：36 发布时间：2022/3/30 21:07:10 python pdf extraction tabula

本文介绍了在python中使用制表器循环访问pdf文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很难让一段代码正常工作。我想遍历文件夹中的pdf文件，提取TABULA包认为的表，将它们提取到一个数据帧中，并将特定pdf中的所有表写入一个CSV文件。

我查看了this post(以及其他几个)，但我仍然无法使其正常工作。脚本似乎循环遍历文件，提取一些表，但它似乎没有遍历文件，而且我无法让它将所有数据帧写入CSV文件。该脚本只写入CSV中的最后一个。

这就是我到目前为止所拥有的。任何帮助都将不胜感激，特别是如何正确地循环文件，以及如何将一个pdf中的所有表写入一个csv文件。我被卡住了……

pdf_folder = 'C:\PDF extract\pdf\'
csv_folder = 'C:\PDF extract\csv\'  

    paths = [pdf_folder + fn for fn in os.listdir(pdf_folder) if fn.endswith('.pdf')]
    for path in paths:
        listdf = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', nospreadsheet = True,multiple_tables=True)
        path = path.replace('pdf', 'csv')
        for df in listdf: (df.to_csv(path, index = False))

推荐答案

就像@Scott Hunter提到的那样，您没有使用csv_Folders

另外，我认为您正在覆盖创建的.csv文件：

for df in listdf: (df.to_csv(path, index = False))

对于for循环的每次迭代，PATH变量保持不变。

编辑： 您可能应该尝试这样做：

pdf_folder = 'C:\PDF extract\pdf\'
paths = [pdf_folder + fn for fn in os.listdir(pdf_folder) if fn.endswith('.pdf')]

for path in paths:
    listdf = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', nospreadsheet = True,multiple_tables=True)
    path = path.replace('pdf', 'csv')
    df_concat = pd.concat(listdf)
    df_concat.to_csv(path, index = False)

这篇关于在python中使用制表器循环访问pdf文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中使用制表器循环访问pdf文件 [英] looping through pdf files with tabulizer in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中使用制表器循环访问pdf文件 [英] looping through pdf files with tabulizer in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭