Python表格错误(Pandas错误?) [英] Python tabula-py error (pandas error?)

查看:913
本文介绍了Python表格错误(Pandas错误?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在线阅读后,我决定使用tabula-py从pdf文件中提取表格.我们使用Anaconda,我刚刚安装了tabula-py 1.1.1.

After some reading online I have decided to use tabula-py to extract tables from pdf files. We use Anaconda and I just installed tabula-py 1.1.1.

我想从一个简单的脚本开始,看看用一个带有一些文本和两个表("table_p16.pdf")的单页pdf文件会做什么.

I wanted to start out with a simple script and see what it would do with a single page pdf file with some text and two tables ("table_p16.pdf").

代码:

from tabula import read_pdf
df = read_pdf("table_p16.pdf")

错误:

拾取了JAVA_TOOL_OPTIONS:-Djava.security.properties = c:\ Windows \ Sun \ Java \ Deployment \ sam.security

Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=c:\Windows\Sun\Java\Deployment\sam.security

回溯(最近通话最近一次):

Traceback (most recent call last):

文件"H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py",第41行,在 df = read_pdf("table_p16.pdf")

File "H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py", line 41, in df = read_pdf("table_p16.pdf")

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ tabula \ wrapper.py",第117行,在read_pdf中 返回pd.read_csv(io.BytesIO(输出),** pandas_options)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\tabula\wrapper.py", line 117, in read_pdf return pd.read_csv(io.BytesIO(output), **pandas_options)

parser_f中的第709行"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py" 返回_read(filepath_or_buffer,kwds)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds)

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py",第455行,_read 数据= parser.read(行)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 455, in _read data = parser.read(nrows)

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py",第1069行,已读取 ret = self._engine.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1069, in read ret = self._engine.read(nrows)

文件"C:\ Users \ xxxxxxxxxxxx \ AppData \ Local \ Continuum \ Anaconda3 \ envs \ test_env \ lib \ site-packages \ pandas \ io \ parsers.py",第1839行,已读取 数据= self._reader.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1839, in read data = self._reader.read(nrows)

pandas._libs.parsers.TextReader.read中的文件"pandas/_libs/parsers.pyx",第902行

File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read

pandas._libs.parsers.TextReader._read_low_memory中的文件"pandas/_libs/parsers.pyx",第924行

File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory

在pandas._libs.parsers.TextReader._read_rows中的文件"pandas/_libs/parsers.pyx",第978行

File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows

在pandas._libs.parsers.TextReader._tokenize_rows中的文件"pandas/_libs/parsers.pyx",第965行

File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows

pandas._libs.parsers.raise_parser_error中的文件"pandas/_libs/parsers.pyx",第2208行

File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError:标记数据时出错. C错误:第9行中预期有8个字段,见9

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 9, saw 9

我尝试过的事情:

  • 由于该错误似乎表明大熊猫存在问题,因此我尝试阅读 一张表格的单页pdf.同样的错误.
  • 将用户变量PATH设置为Java.没有改变任何东西.无法设定 系统变量PATH到Java,因为它当前用于我们的SVN 程序.
  • 不同的代码行,具有相同的错误:

  • Since the error seems to show problems with pandas I tried to read a single page pdf with one table. The same error holds.
  • Set user variable PATH to Java. Did not change anything. Can't set system variable PATH to Java, since it is currently used for our SVN programm.
  • Different code lines, with the same error:

df = read_pdf(r"table_p9.pdf")
df = read_pdf(r"table_p9.pdf")
df = read_pdf("table_p9.pdf", output_format='json')

我希望有人能帮助我找出问题所在.这可能是Java问题,但是我对所需的Java交互并不熟悉.您的帮助非常有用.

I hope someone can chip in and help me figure out where the problem lies. It could be a Java issue, but I am not that familiar with the required Java interaction. Your help is much appriciated.

修改

我尝试了不同的表,但有些表似乎正在工作.很难确定哪种类型的表起作用.有些带有合并"列的行,而有些带有合并"行的行似乎有效.但显然不是全部.另外,我无法使用参数multi_tables = True读取多个表(2或3).

I tried different tables and some seem to be working. It has been difficult to identify what type of tables work. Some with 'merged' columns and others with 'merged' rows seem to work. But clearly not all. Also, I have not been able to read multiple tables (2 or 3) using the argument multiple_tables=True.

Tabula可以处理哪种表?这让我想知道Tabula是否适合使用正确的程序.在完成所有阅读之后,我印象中Tabula会擅长于此.它似乎要处理的表并不复杂.

Is there any source to what kind of tables Tabula can handle? And this makes me wonder whether Tabula is the right program to use. After all the reading I did, I was under the impression that Tabula would be good at this. The tables it seems to struggle with are not complex.

是否有关于如何最大程度地使用Tabula的清晰简单的资料?还是其他有关如何处理Tabula所困扰的表的提示?

Is there a clear and simple source on how to maximize the use of Tabula? Or otherwise tips on how to deal with tables that Tabula struggles with?

关于, 加布里埃尔

推荐答案

这是制表(或制表)选项的粗略准则.​​

This is the rough guideline for tabula (or tabula-py) options.

1)合并单元格与行表 您可以使用lattice=True选项.在点阵模式下,表格可以适当地处理表格行.请注意,您可能需要后期编辑合并单元格的某种fillna.我经历过一些合并的列是用左对齐提取的.

1) Having merged cells with a lined table You can use lattice=True option. With lattice mode, tabula handles line of tables appropriately. Note that, you might need post editing some kind of fillna for merged cells. I experienced some merged columns is extracted with left-justified.

AFAIK,表格很难提取没有表格行的合并单元格.

AFAIK, it's pretty hard for tabula to extract merged cell without line of table.

表格的一般调整点是latticestreamguess.

General tuning points for tabula are lattice, stream, guess.

2)在一个或多个页面中具有多个表 它是特定于表格的选项,您必须使用multiple_tables=True选项.

2) Having multiple tables within one or more pages It's tabula-py specific option, you have to use multiple_tables=True option.

默认情况下,tabula-py尝试通过CSV提取表.虽然这种方法可以从pandas.read_csv函数中受益,例如推断列名. read_csv假定PDF中只有一个表(相同列大小的表).列大小不同的pandas.read_csv会导致ParserError.

By default, tabula-py tries to extract tables via CSV. While this approach can get benefits from pandas.read_csv function like inferring of column names. read_csv assumes a single table (same column size table) in a PDF. pandas.read_csv with different size of columns causes ParserError.

另一方面,使用multiple_tables选项,tabula-py通过JSON创建DataFrame,该JSON可以表示多个表.

On the other hand, with multiple_tables option, tabula-py creates DataFrame via JSON, which can represent multiple tables.

另一个选择.从tabula-py 1.3.0开始,您可以将Tabla应用程序模板与tabula-py一起使用.从模板获取区域数据,您可以通过准确的区域信息更适当地提取.

One more option. From tabula-py 1.3.0, you can use Tabla app templates with tabula-py. Getting area data from template, you could extract more appropriately with accurate area info.

这篇关于Python表格错误(Pandas错误?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆