使用Python从HTML文件中提取文本 [英] Extracting text from HTML file using Python

查看:849
本文介绍了使用Python从HTML文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Python从HTML文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中,我基本上需要获得相同的输出。



我想要一些比使用正则表达式更强大的东西,而这些正则表达式可能会在形成不良的HTML时失败。我见过很多人推荐美丽的汤,但我使用它有一些问题。首先,它收集了不需要的文本,例如JavaScript源代码。另外,它没有解释HTML实体。例如,我期望'在HTML源代码中将文本转换为撇号,就像我将浏览器内容粘贴到记事本一样。


$ b 更新 html2text 看起来很有前途。它正确处理HTML实体并忽略JavaScript。但是,它并不完全产生纯文本;它产生降价,然后不得不变成纯文本。它没有任何示例或文档,但代码看起来很干净。






相关问题:


解决方案

html2text 是一个在这方面做得很好的Python程序。

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.


Related questions:

解决方案

html2text is a Python program that does a pretty good job at this.

这篇关于使用Python从HTML文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆