如何在Python中从PDF文件提取文本? [英] How to extract text from a PDF file in Python?

查看：135 发布时间：2020/7/4 21:22:43 python pypdf

本文介绍了如何在Python中从PDF文件提取文本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用Python从PDF文件中提取文本?

How can I extract text from a PDF file in Python?

我尝试了以下操作:

import sys
import pyPdf

def convertPdf2String(path):
      content = ""
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      for i in range(0, pdf.getNumPages()):
          content += pdf.getPage(i).extractText() + " \n"
          content = " ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

f = open('a.txt','w+')

f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace"))
f.close()

但是结果如下，而不是可读的文本:

But the result is as follows, rather than readable text:

728; ˇˆ〜˚ˇˇ！" ˘ˇˆ˙ˆ˝˛˛˛˛ˆ〜ˆ ˆ˘ˆ˛˙ˆ"ˆ˘" ˆˆˆ ## ˙ˆ˚ˆ％& ˆ #$ %%&(''％$&))$ $ +％#，-.+&&˝())˝) ˝+ ,,-./012)(˝)*˝+，-3˙ˆ/0245)6#57 + 82,55)6#57 +，+ 2，+/！#!!& ;˘˘1％˘20˛˛3ˆ07％4！˘" 6 ˆ˘&/& 4"9ˆ％6ˇ％4％4&5˘2)˘˘˛％:6(

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

推荐答案

如果您正在运行linux或mac，则可以在代码中使用 ps2ascii 命令:

if you are running linux or mac you can use ps2ascii command in your code:

import os

input="someFile.pdf"
output="out.txt"
os.system(("ps2ascii %s %s") %( input , output))

这篇关于如何在Python中从PDF文件提取文本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Python中从PDF文件提取文本? [英] How to extract text from a PDF file in Python?

问题描述

728; ˇˆ〜˚ˇˇ！" ˘ˇˆ˙ˆ˝˛˛˛˛ˆ〜ˆ ˆ˘ˆ˛˙ˆ"ˆ˘" ˆˆˆ ## ˙ˆ˚ˆ％& ˆ #$ %%&(''％$&))$ $ +％#，-.+&&˝())˝) ˝+ ,,-./012)(˝)*˝+，-3˙ˆ/0245)6#57 + 82,55)6#57 +，+ 2，+/！#!!& ;˘˘1％˘20˛˛3ˆ07％4！˘" 6 ˆ˘&/& 4"9ˆ％6ˇ％4％4&5˘2)˘˘˛％:6(

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在Python中从PDF文件提取文本? [英] How to extract text from a PDF file in Python?

问题描述

728; ˇˆ〜˚ˇˇ！" ˘ˇˆ˙ˆ˝˛˛˛˛ˆ〜ˆ ˆ˘ˆ˛˙ˆ"ˆ˘" ˆˆˆ ## ˙ˆ˚ˆ％& ˆ #$ %%&(''％$&))$ $ +％#，-.+&&˝())˝) ˝+ ,,-./012)(˝)*˝+，-3˙ˆ/0245)6#57 + 82,55)6#57 +，+ 2，+/！#!!& ;˘˘1％˘20˛˛3ˆ07％4！˘" 6 ˆ˘&/& 4"9ˆ％6ˇ％4％4&5˘2)˘˘˛％:6(

728;ˇˆ˜ ˚ˇˇ!""˘ˇˆ˙ˆ˝˛˛˛˛ˆ˜ˆ ˆ ˆ˘ˆ˛˙ˆ"ˆ˘"ˆˆˆ˜#$˙ˆ˚ˆ %&ˆ ˘˛ˆ˜'˙˙%˝˛ˆˇ˙ ˜ˆˆ˜'ˆ ˇˆ#$%&('%$&))$$+%#,-.+&&˝())˝)˝+,,-./012)(˝)*˝+,-3˙ˆ/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3ˆ07%4!˘"6 ˛ˆ ˝ˆ ˆ˘&/&4"9ˆ %6ˇ%4%4&5˘2)˘˘˛%:6(

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭