解析未闭合`< BR>在BeautifulSoup`标签 [英] Parsing unclosed `<br>` tags with BeautifulSoup

查看:166
本文介绍了解析未闭合`< BR>在BeautifulSoup`标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BeautifulSoup有连续收逻辑< BR> 标记,没有做完全是我想要做的事。例如,

 >>>从BS4进口BeautifulSoup
>>> BS = BeautifulSoup('吲; BR>二< BR>三< BR>四')

该HTML将呈现为

 



我想它解析为一个字符串列表, [一,二,三化,四] 。 BeautifulSoup的标签收盘逻辑意味着我得到嵌套的标签,当我要求所有的< BR> 元素

 >>> BS('BR')
[< BR>二< BR>三< BR>四< / BR>< / BR>< / BR>中
 < BR>三< BR>四< / BR>< / BR>中
 < BR>四< / BR>]

有没有一种简单的方法来得到我想要的结果呢?


解决方案

 进口BS4作为BS
汤= bs.BeautifulSoup('吲; BR>二< BR>三< BR>四')
打印(soup.find_all(文= TRUE))

收益

  [u'one',u'two',u'three',u'four']


或者,使用 LXML

 进口lxml.html为LH
DOC = LH.fromstring('吲; BR>二< BR>三< BR>四')
打印(列表(doc.itertext()))

收益

  [一,二,三化,四]

BeautifulSoup has logic for closing consecutive <br> tags that doesn't do quite what I want it to do. For example,

>>> from bs4 import BeautifulSoup
>>> bs = BeautifulSoup('one<br>two<br>three<br>four')

The HTML would render as

one
two
three
four

I'd like to parse it into a list of strings, ['one','two','three','four']. BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <br> elements.

>>> bs('br')
[<br>two<br>three<br>four</br></br></br>,
 <br>three<br>four</br></br>,
 <br>four</br>]

Is there a simple way to get the result I want?

解决方案

import bs4 as bs
soup = bs.BeautifulSoup('one<br>two<br>three<br>four')
print(soup.find_all(text=True))

yields

[u'one', u'two', u'three', u'four']


Or, using lxml:

import lxml.html as LH
doc = LH.fromstring('one<br>two<br>three<br>four')
print(list(doc.itertext()))

yields

['one', 'two', 'three', 'four']

这篇关于解析未闭合`&LT; BR&gt;在BeautifulSoup`标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆