HTML解析与BeautifulSoup [英] Parsing HTML with BeautifulSoup

查看：240 发布时间：2016/8/5 18:56:26 python beautifulsoup

本文介绍了HTML解析与BeautifulSoup的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

（图片很小，这里是另一个链接： http://i.imgur.com/OJC0A。 PNG ）

我试图提取的审查在底部的文字。我试过这样：

  Y = soup.find_all（格，风格=保证金左：0.5em;）
回顾= Y [0]的.text

问题是，存在这样很乏味，从审查的内容中删除未展开 DIV 标签不需要的文本。对于我的生活，我实在不明白这一点。可能有人请帮助我吗？

编辑：HTML是：

  DIV的风格=保证金左：0.5em;＆GT;
    ＆LT; DIV的风格=保证金底：0.5em;＆GT; 35人9发现此评论有用＆LT; / DIV＆GT;
    ＆LT; DIV的风格=保证金底：0.5em;＆GT;
    ＆LT; DIV的风格=保证金底：0.5em;＆GT;
    ＆LT; DIV CLASS =微小的风格=保证金底：0.5em;＆GT;
        ＆LT; B＆GT;
    ＆LT; / DIV＆GT;
    那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意，上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。

正文上面的div标签如下：

 ＆LT; DIV CLASS =微小的风格=保证金底：0.5em;＆GT;
    ＆LT; B＆GT;
        ＆LT;跨度类=h3color微小的＆gt;这评论是：＆LT; / SPAN＆GT;
        ＆LT; A HREF =http://rads.stackoverflow.com/amzn/click/B005C7QVUE＆gt;在野狗的舞蹈：冰与火之歌：第5册（Audible音频版）LT; / A＆GT;
    ＆LT; / B＆GT;
＆LT; / DIV＆GT;
那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意，上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。

解决方案

要获得尾部的文字 div.tiny ：

复习= soup.find（格，微小）。findNextSibling（文= TRUE）

完整的示例：

＃！的/ usr /斌/包膜蟒蛇
从BS4进口BeautifulSoupHTML =＆LT; DIV的风格=保证金左：0.5em;＆GT;
＆LT; DIV的风格=保证金底：0.5em;＆GT;
   35人9发现此评论有用＆LT; / DIV＆GT;
＆LT; DIV的风格=保证金底：0.5em;＆GT;
＆LT; DIV的风格=保证金底：0.5em;＆GT;
＆LT; DIV CLASS =微小的风格=保证金底：0.5em;＆GT;
＆LT; B＆GT;
    ＆LT;跨度类=h3color微小的＆gt;这评论是：＆LT; / SPAN＆GT;
    ＆所述; A HREF =http://rads.stackoverflow.com/amzn/click/B005C7QVUE＆GT;
     冰之歌和火：与野狗的舞蹈书5（Audible音频版）LT; / A＆GT;
＆LT; / B＆GT;
＆LT; / DIV＆GT;
那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意，上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。汤= BeautifulSoup（HTML）
复习= soup.find（格，微小）。findNextSibling（文= TRUE）
打印（综述）

输出

那是真实的。今天早上我没有尝试过自己。有在现场发声稍加注意，上面写着几游戏将需要两个学分或类似的东西。与野狗的舞蹈是那几个之一。

下面是一个相当于 LXML code产生相同的输出：

进口lxml.htmlDOC = lxml.html.fromstring（HTML）
打印doc.find（.// DIV [@类='微小']）。尾

(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)

I'm trying to extract the text of the review at the bottom. I've tried this:

y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text

The problem is that there is unwanted text in the unexpanded div tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?

Edit: The HTML is:

div style="margin-left:0.5em;">
    <div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
    <div style="margin-bottom:0.5em;">
    <div style="margin-bottom:0.5em;">
    <div class="tiny" style="margin-bottom:0.5em;">
        <b>
    </div>
    That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

The div tag above the text is as follows:

<div class="tiny" style="margin-bottom:0.5em;">
    <b>
        <span class="h3color tiny">This review is from: </span>
        <a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
    </b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

解决方案

To get the text in the tail of div.tiny:

review = soup.find("div", "tiny").findNextSibling(text=True)

Full example:

#!/usr/bin/env python
from bs4 import BeautifulSoup

html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
   9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
    <span class="h3color tiny">This review is from: </span>
    <a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
     A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""

soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)

Output


That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

Here's an equivalent lxml code that produces the same output:

import lxml.html

doc = lxml.html.fromstring(html)
print doc.find(".//div[@class='tiny']").tail

这篇关于HTML解析与BeautifulSoup的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

HTML解析与BeautifulSoup [英] Parsing HTML with BeautifulSoup

问题描述

输出

Output

相关文章

Python最新文章

热门教程

热门工具

登录关闭

HTML解析与BeautifulSoup [英] Parsing HTML with BeautifulSoup

问题描述

输出

Output

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭