使用 Python 从 HTML 文件中提取文本 [英] Extracting text from HTML file using Python

查看:48
本文介绍了使用 Python 从 HTML 文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Python 从 HTML 文件中提取文本.如果我从浏览器复制文本并将其粘贴到记事本中,我想要的输出基本上相同.

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

我想要比使用可能在格式不佳的 HTML 上失败的正则表达式更健壮的东西.我看到很多人推荐 Beautiful Soup,但我在使用它时遇到了一些问题.一方面,它选择了不需要的文本,例如 JavaScript 源代码.此外,它不解释 HTML 实体.例如,我希望 '在 HTML 源代码中转换为文本中的撇号,就像我将浏览器内容粘贴到记事本中一样.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

更新 html2text 看起来很有希望.它正确处理 HTML 实体并忽略 JavaScript.但是,它并不完全生成纯文本;它会生成 Markdown,然后必须将其转换为纯文本.它没有示例或文档,但代码看起来很干净.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.

相关问题:

推荐答案

html2text 是一个 Python 程序在这方面做得很好.

html2text is a Python program that does a pretty good job at this.

这篇关于使用 Python 从 HTML 文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆