解析未闭合`&LT; BR&gt;在BeautifulSoup`标签 [英] Parsing unclosed `<br>` tags with BeautifulSoup
本文介绍了解析未闭合`&LT; BR&gt;在BeautifulSoup`标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
BeautifulSoup有连续收逻辑&LT; BR&GT;
标记,没有做完全是我想要做的事。例如,
&GT;&GT;&GT;从BS4进口BeautifulSoup
&GT;&GT;&GT; BS = BeautifulSoup('吲; BR&GT;二&LT; BR&GT;三&LT; BR&GT;四')
该HTML将呈现为
单
二
三
四
我想它解析为一个字符串列表, [一,二,三化,四]
。 BeautifulSoup的标签收盘逻辑意味着我得到嵌套的标签,当我要求所有的&LT; BR&GT;
元素
&GT;&GT;&GT; BS('BR')
[&LT; BR&GT;二&LT; BR&GT;三&LT; BR&GT;四&LT; / BR&GT;&LT; / BR&GT;&LT; / BR&gt;中
&LT; BR&GT;三&LT; BR&GT;四&LT; / BR&GT;&LT; / BR&gt;中
&LT; BR&GT;四&LT; / BR&GT;]
有没有一种简单的方法来得到我想要的结果呢?
解决方案
进口BS4作为BS
汤= bs.BeautifulSoup('吲; BR&GT;二&LT; BR&GT;三&LT; BR&GT;四')
打印(soup.find_all(文= TRUE))
收益
[u'one',u'two',u'three',u'four']
或者,使用 LXML :
进口lxml.html为LH
DOC = LH.fromstring('吲; BR&GT;二&LT; BR&GT;三&LT; BR&GT;四')
打印(列表(doc.itertext()))
收益
[一,二,三化,四]
BeautifulSoup has logic for closing consecutive <br>
tags that doesn't do quite what I want it to do. For example,
>>> from bs4 import BeautifulSoup
>>> bs = BeautifulSoup('one<br>two<br>three<br>four')
The HTML would render as
one
two
three
four
I'd like to parse it into a list of strings, ['one','two','three','four']
. BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <br>
elements.
>>> bs('br')
[<br>two<br>three<br>four</br></br></br>,
<br>three<br>four</br></br>,
<br>four</br>]
Is there a simple way to get the result I want?
解决方案
import bs4 as bs
soup = bs.BeautifulSoup('one<br>two<br>three<br>four')
print(soup.find_all(text=True))
yields
[u'one', u'two', u'three', u'four']
Or, using lxml:
import lxml.html as LH
doc = LH.fromstring('one<br>two<br>three<br>four')
print(list(doc.itertext()))
yields
['one', 'two', 'three', 'four']
这篇关于解析未闭合`&LT; BR&gt;在BeautifulSoup`标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文