HTML解析与BeautifulSoup [英] Parsing HTML with BeautifulSoup
问题描述
(图片很小,这里是另一个链接: http://i.imgur.com/OJC0A。 PNG )
我试图提取的审查在底部的文字。我试过这样:
Y = soup.find_all(格,风格=保证金左:0.5em;)
回顾= Y [0]的.text
问题是,存在这样很乏味,从审查的内容中删除未展开 DIV
标签不需要的文本。对于我的生活,我实在不明白这一点。可能有人请帮助我吗?
编辑:HTML是:
DIV的风格=保证金左:0.5em;>
< DIV的风格=保证金底:0.5em;> 35人9发现此评论有用< / DIV>
< DIV的风格=保证金底:0.5em;>
< DIV的风格=保证金底:0.5em;>
< DIV CLASS =微小的风格=保证金底:0.5em;>
< B>
< / DIV>
那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意,上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。
正文上面的div标签如下:
< DIV CLASS =微小的风格=保证金底:0.5em;>
< B>
<跨度类=h3color微小的>这评论是:< / SPAN>
< A HREF =http://rads.stackoverflow.com/amzn/click/B005C7QVUE>在野狗的舞蹈:冰与火之歌:第5册(Audible音频版)LT; / A>
< / B>
< / DIV>
那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意,上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。
要获得尾部的文字 div.tiny
:
复习= soup.find(格,微小)。findNextSibling(文= TRUE)
完整的示例:
#!的/ usr /斌/包膜蟒蛇
从BS4进口BeautifulSoupHTML =< DIV的风格=保证金左:0.5em;>
< DIV的风格=保证金底:0.5em;>
35人9发现此评论有用< / DIV>
< DIV的风格=保证金底:0.5em;>
< DIV的风格=保证金底:0.5em;>
< DIV CLASS =微小的风格=保证金底:0.5em;>
< B>
<跨度类=h3color微小的>这评论是:< / SPAN>
&所述; A HREF =http://rads.stackoverflow.com/amzn/click/B005C7QVUE>
冰之歌和火:与野狗的舞蹈书5(Audible音频版)LT; / A>
< / B>
< / DIV>
那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意,上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。汤= BeautifulSoup(HTML)
复习= soup.find(格,微小)。findNextSibling(文= TRUE)
打印(综述)
输出
那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意,上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。
下面是一个相当于 LXML
code产生相同的输出:
进口lxml.htmlDOC = lxml.html.fromstring(HTML)
打印doc.find(.// DIV [@类='微小'])。尾
(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)
I'm trying to extract the text of the review at the bottom. I've tried this:
y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text
The problem is that there is unwanted text in the unexpanded div
tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?
Edit: The HTML is:
div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
The div tag above the text is as follows:
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
To get the text in the tail of div.tiny
:
review = soup.find("div", "tiny").findNextSibling(text=True)
Full example:
#!/usr/bin/env python
from bs4 import BeautifulSoup
html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""
soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)
Output
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
Here's an equivalent lxml
code that produces the same output:
import lxml.html
doc = lxml.html.fromstring(html)
print doc.find(".//div[@class='tiny']").tail
这篇关于HTML解析与BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!