Python表格错误(Pandas错误?) [英] Python tabula-py error (pandas error?)

查看：913 发布时间：2020/5/25 5:20:29 python pandas pdf tabula

本文介绍了Python表格错误(Pandas错误?)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在线阅读后，我决定使用tabula-py从pdf文件中提取表格.我们使用Anaconda，我刚刚安装了tabula-py 1.1.1.

After some reading online I have decided to use tabula-py to extract tables from pdf files. We use Anaconda and I just installed tabula-py 1.1.1.

我想从一个简单的脚本开始，看看用一个带有一些文本和两个表("table_p16.pdf")的单页pdf文件会做什么.

I wanted to start out with a simple script and see what it would do with a single page pdf file with some text and two tables ("table_p16.pdf").

代码:

from tabula import read_pdf
df = read_pdf("table_p16.pdf")

错误:

拾取了JAVA_TOOL_OPTIONS:-Djava.security.properties = c:\ Windows \ Sun \ Java \ Deployment \ sam.security

Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=c:\Windows\Sun\Java\Deployment\sam.security

回溯(最近通话最近一次):

Traceback (most recent call last):

文件"H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py"，第41行，在 df = read_pdf("table_p16.pdf")

File "H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py", line 41, in df = read_pdf("table_p16.pdf")

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ tabula \ wrapper.py"，第117行，在read_pdf中返回pd.read_csv(io.BytesIO(输出)，** pandas_options)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\tabula\wrapper.py", line 117, in read_pdf return pd.read_csv(io.BytesIO(output), **pandas_options)

parser_f中的第709行"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py" 返回_read(filepath_or_buffer，kwds)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds)

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py"，第455行，_read 数据= parser.read(行)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 455, in _read data = parser.read(nrows)

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py"，第1069行，已读取 ret = self._engine.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1069, in read ret = self._engine.read(nrows)

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py"，第1839行，已读取数据= self._reader.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1839, in read data = self._reader.read(nrows)

pandas._libs.parsers.TextReader.read中的文件"pandas/_libs/parsers.pyx"，第902行

File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read

pandas._libs.parsers.TextReader._read_low_memory中的文件"pandas/_libs/parsers.pyx"，第924行

File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory

在pandas._libs.parsers.TextReader._read_rows中的文件"pandas/_libs/parsers.pyx"，第978行

File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows

在pandas._libs.parsers.TextReader._tokenize_rows中的文件"pandas/_libs/parsers.pyx"，第965行

File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows

pandas._libs.parsers.raise_parser_error中的文件"pandas/_libs/parsers.pyx"，第2208行

File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError:标记数据时出错. C错误:第9行中预期有8个字段，见9

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 9, saw 9

我尝试过的事情:

由于该错误似乎表明大熊猫存在问题，因此我尝试阅读一张表格的单页pdf.同样的错误.
将用户变量PATH设置为Java.没有改变任何东西.无法设定系统变量PATH到Java，因为它当前用于我们的SVN 程序.
不同的代码行，具有相同的错误:

Since the error seems to show problems with pandas I tried to read a single page pdf with one table. The same error holds.
Set user variable PATH to Java. Did not change anything. Can't set system variable PATH to Java, since it is currently used for our SVN programm.
Different code lines, with the same error:

df = read_pdf(r"table_p9.pdf")
df = read_pdf(r"table_p9.pdf")
df = read_pdf("table_p9.pdf", output_format='json')

我希望有人能帮助我找出问题所在.这可能是Java问题，但是我对所需的Java交互并不熟悉.您的帮助非常有用.

I hope someone can chip in and help me figure out where the problem lies. It could be a Java issue, but I am not that familiar with the required Java interaction. Your help is much appriciated.

修改

我尝试了不同的表，但有些表似乎正在工作.很难确定哪种类型的表起作用.有些带有合并"列的行，而有些带有合并"行的行似乎有效.但显然不是全部.另外，我无法使用参数multi_tables = True读取多个表(2或3).

I tried different tables and some seem to be working. It has been difficult to identify what type of tables work. Some with 'merged' columns and others with 'merged' rows seem to work. But clearly not all. Also, I have not been able to read multiple tables (2 or 3) using the argument multiple_tables=True.

Tabula可以处理哪种表?这让我想知道Tabula是否适合使用正确的程序.在完成所有阅读之后，我印象中Tabula会擅长于此.它似乎要处理的表并不复杂.

Is there any source to what kind of tables Tabula can handle? And this makes me wonder whether Tabula is the right program to use. After all the reading I did, I was under the impression that Tabula would be good at this. The tables it seems to struggle with are not complex.

是否有关于如何最大程度地使用Tabula的清晰简单的资料?还是其他有关如何处理Tabula所困扰的表的提示?

Is there a clear and simple source on how to maximize the use of Tabula? Or otherwise tips on how to deal with tables that Tabula struggles with?

关于，加布里埃尔

Python表格错误(Pandas错误?) [英] Python tabula-py error (pandas error?)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python表格错误(Pandas错误?) [英] Python tabula-py error (pandas error?)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭