将数据从PDFform转换为CSV [英] Convert data from PDFform to CSV

查看:183
本文介绍了将数据从PDFform转换为CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将以多个可填写PDF格式输入的数据转换为一个csv文件。

此代码由几个步骤组成:


  1. 打开新的.csv文件(标题行)

  2. 使用for ... in循环打开多个pdf格式

  3. 将在表单字段中输入的数据转换为csv

但是,运行命令时, / p>

  fc-int01-generateAppearances:None 
回溯(最近一次呼叫):
在< module>中的文件C:\ Python27 \Scripts\test3.py,第31行
writer.writerow(value)
_csv.Error:sequence expected

我只是打印值(表单数据)在python,它的工作原理。但是导入数据没有。也许也有一个问题,从行到列的值。

  

import glob
import os
import sys
import csv
从pdfminer.pdfparser import PDFParser
从pdfminer.pdfdocument import PDFDocument
从pdfminer.pdftypes import resolve1

特定文件的#input文件路径
#filename =C:\Python27\Scripts\MH_1.pdf
#fp = open(filename,'rb ')

#open new csv file
out_file = open('C:\Users\Wonen\Downloads\Test\output.csv','w +')
writer = csv.writer(out_file)
#header row
writer.writerow(('Name coordinator','Date','Address','District',
'City ','Complaintnr'))

#enter文件夹路径以打开多个文件
path ='C:\Users\Wonen\Downloads\Test'
在glob.glob中的文件名(os.path.join(path,'* .pdf')):
fp = open(filename,'rb')
#read pdf's
parser = PDFParser (fp)
doc = PDFDocument(parser)
#doc.initialize()#<如果需要密码
fields = resolve1(doc.catalog ['AcroForm']) 'Fields']
for i in fields:
field = resolve1(i)
name,value = field.get('T'),field.get('V')
print'{0}:{1}'。format(name,value)
writer.writerow(value)






使用 print(repr(value))输出一个文本pdf(包括所有输出)

 
'Crip Gang'



/ Ja

/ 1
/ 1


/ Ja
/ Ja



'wfwf'
'sd'
'dfwf'
'ffasf'
' tsdbd'
'dfadfasdf'

'df'

'asdff'

'wff'

'ffs'











'1'
'2'
'7'
/ 0
'Ja'
'两个无限'
'Jack'

'www.kijkbijmij.nl'
'Onderverhuur'
/ Ja

等。 无代表空文本框;

解决方案

尝试更改最后一个部分您的代码如下所示:

 


#enter文件夹路径打开多个文件
path ='C:\Users\Wonen\Downloads\Test'
文件名在glob.glob(os.path.join (path,'* .pdf')):
fp = open(filename,'rb')
#read pdf's
parser = PDFParser(fp)
doc = PDFDocument parse)
#doc.initialize()#<如果需要密码
fields = resolve1(doc.catalog ['AcroForm'])['Fields']
row = ]
for i in fields:
field = resolve1(i)
name,value = field.get('T'),field.get('V')
row .append(value)
writer.writerow(row)

out_file.close()

这不清楚这将工作,但它可能会提供您需要解决您的问题的信息。



一个令人困惑的是, csv的第一个标题行:

  writer.writerow(('Name coordinator','Date','Address' 'District','City','Complaintnr'))

包含在每行中写入。这意味着字段应该是由该订单中6个项目的数据组成的列表。



找出如何将每个字段中的内容翻译成6个数据项的列表。这是我的答案中的代码 - 我想,但不能测试。


I am trying to convert the data entered in multiple fill-able pdf-forms to one csv file.
This code consists of a few steps:

  1. Open new .csv file (header row)
  2. Open multiple pdf-forms with "for...in" loop
  3. Convert data entered in form-fields to csv

However, when running the command I receive the error:

fc-int01-generateAppearances: None
Traceback (most recent call last):
    File "C:\Python27\Scripts\test3.py", line 31, in <module>
        writer.writerow(value)
    _csv.Error: sequence expected

If I just the print value (form data) in python, it works. But importing the data does not. There is maybe also a problem of going from row to column with value. I hope I am clear.

Here is my code:

import glob
import os
import sys
import csv
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

#input file path for specific file
#filename = "C:\Python27\Scripts\MH_1.pdf"
#fp = open(filename, 'rb')

#open new csv file
out_file=open('C:\Users\Wonen\Downloads\Test\output.csv', 'w+')
writer = csv.writer(out_file)
#header row
writer.writerow(('Name coordinator', 'Date', 'Address', 'District',
                 'City', 'Complaintnr'))

#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        print '{0}: {1}'.format(name, value)
        writer.writerow(value)


The output with a text pdf (including all output) using print (repr(value)):

None
'Crip Gang'
None
None
None
/Ja
None
/1
/1
None
None
/Ja
/Ja
None
None
None
'wfwf'
'sd'
'dfwf'
'ffasf'
'tsdbd'
'dfadfasdf'
None
'df'
None
'asdff'
None
'wff'
None
'ffs'
None
None
None
None
None
None
None
None
None
None
None
'1'
'2'
'7'
/0
'Ja'
'Two unlimited'
'Captain Jack'
None
'www.kijkbijmij.nl'
'Onderverhuur'
/Ja

etc. etc. "None" stands for "empty text box"; and "1" and "0" stand for "yes" and "no" outputs.

解决方案

Try changing the last part of your code as shown:

    .
    .
    .
#enter folder path to open multiple files
path = 'C:\Users\Wonen\Downloads\Test'
for filename in glob.glob(os.path.join(path, '*.pdf')):
    fp = open(filename, 'rb')
    #read pdf's
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    #doc.initialize()    # <<if password is required
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    row = []
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        row.append(value)
    writer.writerow(row)

out_file.close()

It's not clear this will work, but it may provide you with the information you need to solve your problem.

One confusing thing is that for the first header row of the csv:

writer.writerow(('Name coordinator', 'Date', 'Address','District','City', 'Complaintnr'))

which defines how many field values will be contained in each row written. This means that fields should be a list consisting of data for those 6 items in that order.

You need to figure out how to translate what's in each group of fields into a row list of 6 data items. That is what the code in my answer does — I think, but can't test.

这篇关于将数据从PDFform转换为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆