使用 Python 从 HTML 文件中提取文本 [英] Extracting text from HTML file using Python

查看：48 发布时间：2021/12/1 13:06:47 python html text html-content-extraction

本文介绍了使用 Python 从 HTML 文件中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用 Python 从 HTML 文件中提取文本.如果我从浏览器复制文本并将其粘贴到记事本中，我想要的输出基本上相同.

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

我想要比使用可能在格式不佳的 HTML 上失败的正则表达式更健壮的东西.我看到很多人推荐 Beautiful Soup，但我在使用它时遇到了一些问题.一方面，它选择了不需要的文本，例如 JavaScript 源代码.此外，它不解释 HTML 实体.例如，我希望 '在 HTML 源代码中转换为文本中的撇号，就像我将浏览器内容粘贴到记事本中一样.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

更新 html2text 看起来很有希望.它正确处理 HTML 实体并忽略 JavaScript.但是，它并不完全生成纯文本；它会生成 Markdown，然后必须将其转换为纯文本.它没有示例或文档，但代码看起来很干净.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.

使用 Python 从 HTML 文件中提取文本 [英] Extracting text from HTML file using Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用 Python 从 HTML 文件中提取文本 [英] Extracting text from HTML file using Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭