如何在 Python 中将提取的文本从 PDF 转换为 JSON 或 XML 格式? [英] How to convert the extracted text from PDF to JSON or XML format in Python?
问题描述
我正在使用 PyPDF2 从 PDF 文件中提取数据,然后转换为文本格式?
I am using PyPDF2 to extract the data from PDF file and then converting into Text format?
文件的PDF格式是这样的:
PDF format for the file is like this:
Name : John
Address: 123street , USA
Phone No: 123456
Gender: Male
Name : Jim
Address: 456street , USA
Phone No: 456899
Gender: Male
在 Python 中我使用这个代码:
In Python I am using this code:
import PyPDF2
pdf_file = open('C:\\Users\\Desktop\\Sampletest.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
page_content
这是我从 page_content 得到的结果:
This is the outcome which I get from page_content:
'Name : John \n \nAddress: 123street , USA \n \nPhone No: 123456\n \nGender: Male \n \n \nName : Jim \n \nAddress: 456street , USA \n \nPhone No: 456899\n \nGender: Male \n \n \n'
如何将其格式化为 JSON 或 XML 格式,以便我可以使用 SQL 服务器数据库中提取的数据.
How do I format it in a JSON or XML format so that I could use extracted data in SQL server database.
我也尝试使用这种方法
import json
data = json.dumps(page_content)
formatj = json.loads(data)
print (formatj)
输出:
Name : John
Address: 123street , USA
Phone No: 123456
Gender: Male
Name : Jim
Address: 456street , USA
Phone No: 456899
Gender: Male
这与我在 word 文件中的输出相同,但我认为这不是 JSON 格式.
This is the same output which I have in my word file, but I don't think that this is in JSON format.
推荐答案
不是很漂亮,但我认为这样可以完成工作.你会得到一个字典,然后由 json 解析器以一种漂亮、漂亮的格式打印出来.
Not so pretty, but this would get the job done, I think. You would get a dictionary which then gets printed by the json parser in a nice, pretty format.
import json
def get_data(page_content):
_dict = {}
page_content_list = page_content.splitlines()
for line in page_content_list:
if ':' not in line:
continue
key, value = line.split(':')
_dict[key.strip()] = value.strip()
return _dict
page_data = get_data(page_content)
json_data = json.dumps(page_data, indent=4)
print(json_data)
或者,而不是最后 3 行,只需执行以下操作:
or, instead of those last 3 lines, just do this:
print(json.dumps(get_data(page_content), indent=4))
这篇关于如何在 Python 中将提取的文本从 PDF 转换为 JSON 或 XML 格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!