Python Beautiful Soup .content 属性 [英] Python Beautiful Soup .content Property

查看:18
本文介绍了Python Beautiful Soup .content 属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BeautifulSoup 的 .content 有什么作用?我正在学习 crummy.com 的 教程,但我不太明白.content 做什么.我已经看了论坛,我没有看到任何答案.看看下面的代码....

What does BeautifulSoup's .content do? I am working through crummy.com's tutorial and I don't really understand what .content does. I have looked at the forums and I have not seen any answers. Looking at the code below....

from BeautifulSoup import BeautifulSoup
import re



doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
        '</html>']

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[0].contents[0].contents[0].name

我希望代码的最后一行打印出body"而不是...

I would expect the last line of the code to print out 'body' instead of...

  File "pe_ratio.py", line 29, in <module>
    print soup.contents[0].contents[0].contents[0].contents[0].name
  File "C:Python27libBeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'name'

.content 只关心 html、head 和 title 吗?如果,那是为什么?

Is .content only concerned with html, head and title? If, so why is that?

提前感谢您的帮助.

推荐答案

它只是给你什么 inside 标签.让我用一个例子来演示:

It just gives you whats inside the tag. Let me demonstrate with an example:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
head = soup.head

print head.contents

上面的代码给了我一个列表,[睡鼠的故事],因为那是insidehead> 标签.所以调用 [0] 会给你列表中的第一项.

The above code gives me a list,[<title>The Dormouse's story</title>], because thats inside the head tag. So calling [0] would give you the first item in the list.

您收到错误的原因是因为 soup.contents[0].contents[0].contents[0].contents[0] 返回没有其他标签的内容(因此没有属性).它从你的代码中返回 Page Title,因为第一个 contents[0] 给你 HTML 标签,第二个给你 head标签.第三个指向 title 标签,第四个给出实际内容.所以,当你在上面调用 name 时,它没有标签给你.

The reason you get an error is because soup.contents[0].contents[0].contents[0].contents[0] returns something with no further tags (therefore no attributes). It returns Page Title from your code, because the first contents[0] gives you the HTML tag, the second one, gives you the head tag. The third one leads to the title tag, and the fourth one gives you the actual content. So, when you call a name on it, it has no tags to give you.

如果要打印正文,可以执行以下操作:

If you want the body printed, you can do the following:

soup = BeautifulSoup(''.join(doc))
print soup.body

如果您希望 body 仅使用 contents,请使用以下内容:

If you want body using contents only, then use the following:

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[1].name

使用[0]作为索引不会得到它,因为bodyhead之后的第二个元素.

You will not get it using [0] as the index, because body is the second element after head.

这篇关于Python Beautiful Soup .content 属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆