使用Python从HTML中提取可读文本? [英] Extracting readable text from HTML using Python?

查看:253
本文介绍了使用Python从HTML中提取可读文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道像html2text,BeautifulSoup等utils,但问题是,他们也提取javascript并将其添加到文本,使它们很难分开。

  htmlDom = BeautifulSoup(网页)

htmlDom.findAll(text = True)

或者,

 从条形图导入html2text 
extract = html2text(webPage)

这两种解压缩页面上的所有javascript,这是不希望的。

我只是想要可以从浏览器中复制的可读文本。如果你想避免提取任何脚本的内容,你可以使用

解决方案标签与BeautifulSoup,

  nonscripttags = htmlDom.findAll(lambda t:t.name!='script',recursive = False) 

会为您做到这一点,获得根的直接子代,它们是非脚本标记(并且是单独的 htmlDom.findAll(recursive = False,text = True)将获得根的直接子项的字符串)。您需要递归执行此操作;例如,作为生成器:

  def nonScript(标记):
返回tag.name!='script'

def getStrings(root):
for root.childGenerator():
如果hasattr(s,'name'):#那么它是一个标签
if s.name =='脚本':#跳过它!
在getStrings(s)中为x继续
:yield x
else:#它是一个字符串!
yield s

我使用 childGenerator (代替 findAll ),这样我就可以让所有的孩子顺序地完成自己的过滤。


I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

Alternately,

from stripogram import html2text
extract = html2text(webPage)

Both of these extract all the javascript on the page as well, this is undesired.

I just wanted the readable text which you could copy from your browser to be extracted.

解决方案

If you want to avoid extracting any of the contents of script tags with BeautifulSoup,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do that for you, getting the root's immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

I'm using childGenerator (in lieu of findAll) so that I can just get all the children in order and do my own filtering.

这篇关于使用Python从HTML中提取可读文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆