涉及带有属性的 HTML 标签的 Python 网页抓取 [英] Python web scraping involving HTML tags with attributes

查看:19
本文介绍了涉及带有属性的 HTML 标签的 Python 网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试制作一个网络爬虫,它将解析出版物的网页并提取作者.网页的骨架结构如下:

<身体><div id="容器"><div id="内容"><表格><tr><td class="author">####我想要这里的任何东西###</td></tr></tbody>

到目前为止,我一直在尝试使用 BeautifulSoup 和 lxml 来完成此任务,但我不确定如何处理这两个 div 标签和 td 标签,因为它们具有属性.除此之外,我不确定我是否应该更多地依赖 BeautifulSoup 或 lxml 或两者的组合.我该怎么办?

目前,我的代码如下所示:

 导入重新导入 urllib2,sys导入 lxml从 lxml 导入 etree从 lxml.html.soupparser 导入 fromstring从 lxml.etree 导入到字符串从 lxml.cssselect 导入 CSSSelector从 BeautifulSoup 导入 BeautifulSoup, NavigableString地址='http://www.example.com/'html = urllib2.urlopen(地址).read()汤 = BeautifulSoup(html)html=soup.prettify()html=html.replace('&nbsp', '&#160')html=html.replace('&iacute','&#237')根=从字符串(html)

我意识到很多导入语句可能是多余的,但我只是复制了当前在更多源文件中的所有内容.

我想我没有说清楚,但是我想抓取页面中有多个标签.

解决方案

从你的问题中我不清楚为什么你需要担心 div 标签——如果只做:>

soup = BeautifulSoup(html)thetd = 汤.find('td', attrs={'class': 'author'})打印 td.string

在您提供的 HTML 上,运行它会准确地发出:

####我想要这里的任何东西###

这似乎是您想要的.也许您可以更准确地指定您需要什么,而这个超级简单的代码段并没有做——多个 td 标记您需要考虑的所有类 author(全部?只是一些?哪些?),可能缺少任何这样的标签(在这种情况下你想做什么),等等.仅从这个简单的示例和过多的代码就很难推断出您的规格到底是什么;-)

编辑:如果根据 OP 的最新评论,有多个这样的 td 标签,每个作者一个:

thetds = soup.findAll('td', attrs={'class': 'author'})对于 thetds 中的 thetd:打印 td.string

...也就是说,一点也不难!-)

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?

At the moment, my code looks like what is below:

    import re
    import urllib2,sys
    import lxml
    from lxml import etree
    from lxml.html.soupparser import fromstring
    from lxml.etree import tostring
    from lxml.cssselect import CSSSelector
    from BeautifulSoup import BeautifulSoup, NavigableString

    address='http://www.example.com/'
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)
    html=soup.prettify()
    html=html.replace('&nbsp', '&#160')
    html=html.replace('&iacute','&#237')
    root=fromstring(html)

I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.

EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

解决方案

It's not clear to me from your question why you need to worry about the div tags -- what about doing just:

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

On the HTML you give, running this emits exactly:

####I want whatever is located here ###

which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).

Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

...i.e., not much harder at all!-)

这篇关于涉及带有属性的 HTML 标签的 Python 网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆