使用Python从HTML文件中提取文本 [英] Extracting text from HTML file using Python

查看：849 发布时间：2018/6/13 9:31:35 python html text html-content-extraction

本文介绍了使用Python从HTML文件中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用Python从HTML文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中，我基本上需要获得相同的输出。

我想要一些比使用正则表达式更强大的东西，而这些正则表达式可能会在形成不良的HTML时失败。我见过很多人推荐美丽的汤，但我使用它有一些问题。首先，它收集了不需要的文本，例如JavaScript源代码。另外，它没有解释HTML实体。例如，我期望&＃39;在HTML源代码中将文本转换为撇号，就像我将浏览器内容粘贴到记事本一样。

$ b 更新 html2text 看起来很有前途。它正确处理HTML实体并忽略JavaScript。但是，它并不完全产生纯文本;它产生降价，然后不得不变成纯文本。它没有任何示例或文档，但代码看起来很干净。

相关问题：

在python中过滤掉HTML标签并解析实体

在Python中将XML / HTML实体转换为Unicode字符串

解决方案

html2text 是一个在这方面做得很好的Python程序。

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.

使用Python从HTML文件中提取文本 [英] Extracting text from HTML file using Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用Python从HTML文件中提取文本 [英] Extracting text from HTML file using Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭