使用 BeautifulSoup 解析 HTML [英] Parsing HTML with BeautifulSoup

查看:20
本文介绍了使用 BeautifulSoup 解析 HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(图片很小,这里还有一个链接:http://i.imgur.com/OJC0A.png)

我正在尝试提取底部的评论文本.我试过这个:

y = soup.find_all("div", style = "margin-left:0.5em;")评论 = y[0].text

问题在于未展开的 div 标签中存在不需要的文本,从评论内容中删除这些文本变得乏味.对于我的生活,我无法弄清楚这一点.有人可以帮我吗?

HTML 是:

div style="margin-left:0.5em;"><div style="margin-bottom:0.5em;">35 人中有 9 人认为以下评论有帮助 </div><div style="margin-bottom:0.5em;"><div style="margin-bottom:0.5em;"><div class="tiny" style="margin-bottom:0.5em;"><b>

那是真实的.我今天早上自己试过了.Audible 网站上有一个小说明,上面写着一些标题需要两个学分"或类似的内容.与龙共舞就是其中之一.

文字上方的div标签如下:

<b><span class="h3color tiny">这篇评论来自:</span><a href="https://rads.stackoverflow.com/amzn/click/com/B005C7QVUE" rel="nofollow noreferrer">与龙共舞:冰与火之歌:第 5 册(可听音频版))</b>

那是真实的.我今天早上自己试过了.Audible 网站上有一个小说明,上面写着一些标题需要两个学分"或类似的内容.与龙共舞就是其中之一.

解决方案

获取div.tiny尾部的文本:

review = soup.find("div", "tiny").findNextSibling(text=True)

完整示例:

#!/usr/bin/env python从 bs4 导入 BeautifulSouphtml = """<div style="margin-left:0.5em;"><div style="margin-bottom:0.5em;">35 人中有 9 人认为以下评论有帮助 </div><div style="margin-bottom:0.5em;"><div style="margin-bottom:0.5em;"><div class="tiny" style="margin-bottom:0.5em;"><b><span class="h3color tiny">这篇评论来自:</span><a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">魔龙的狂舞:冰与火之歌:第 5 册(音频版)</a></b>

那是真实的.我今天早上自己试过了.Audible 网站上有一个小说明,上面写着一些标题需要两个学分"或类似的内容.魔龙的狂舞就是其中之一."""汤 = BeautifulSoup(html)评论 = 汤.find("div", "tiny").findNextSibling(text=True)打印(评论)

输出

<前>那是真实的.我今天早上自己试过了.Audible 网站上有一个小说明,上面写着一些标题需要两个学分"或类似的内容.与龙共舞就是其中之一.

下面是产生相同输出的等效 lxml 代码:

import lxml.htmldoc = lxml.html.fromstring(html)打印 doc.find(".//div[@class='tiny']").tail

(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)

I'm trying to extract the text of the review at the bottom. I've tried this:

y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text

The problem is that there is unwanted text in the unexpanded div tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?

Edit: The HTML is:

div style="margin-left:0.5em;">
    <div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
    <div style="margin-bottom:0.5em;">
    <div style="margin-bottom:0.5em;">
    <div class="tiny" style="margin-bottom:0.5em;">
        <b>
    </div>
    That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few. 

The div tag above the text is as follows:

<div class="tiny" style="margin-bottom:0.5em;">
    <b>
        <span class="h3color tiny">This review is from: </span>
        <a href="https://rads.stackoverflow.com/amzn/click/com/B005C7QVUE" rel="nofollow noreferrer">A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
    </b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few. 

解决方案

To get the text in the tail of div.tiny:

review = soup.find("div", "tiny").findNextSibling(text=True)

Full example:

#!/usr/bin/env python
from bs4 import BeautifulSoup

html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
   9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
    <span class="h3color tiny">This review is from: </span>
    <a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
     A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""

soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)

Output


That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

Here's an equivalent lxml code that produces the same output:

import lxml.html

doc = lxml.html.fromstring(html)
print doc.find(".//div[@class='tiny']").tail

这篇关于使用 BeautifulSoup 解析 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆