Python网络刮涉及的HTML标签与属性 [英] Python web scraping involving HTML tags with attributes

查看:147
本文介绍了Python网络刮涉及的HTML标签与属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使Web刮板,将解析出版物的网页并提取作者。该网页的骨架结构是如下:

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

我一直在尝试使用BeautifulSoup和LXML迄今完成这项任务,但我不知道如何处理这两个div标签和td标签,因为他们的属性。除了这个,我不知道我是否应该更多地依靠BeautifulSoup或LXML或两者的组合。我应该怎么办?

I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?

目前,我的code看起来像下面是什么:

At the moment, my code looks like what is below:

    import re
    import urllib2,sys
    import lxml
    from lxml import etree
    from lxml.html.soupparser import fromstring
    from lxml.etree import tostring
    from lxml.cssselect import CSSSelector
    from BeautifulSoup import BeautifulSoup, NavigableString

    address='http://www.example.com/'
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)
    html=soup.prettify()
    html=html.replace('&nbsp', '&#160')
    html=html.replace('&iacute','&#237')
    root=fromstring(html)

我知道了很多import语句可能是多余的,但我刚才复制什么我目前有更多的源文件。

I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.

编辑:我想我并没有使这个很清楚,但我有在页多个标签,我想凑。

I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

推荐答案

这不是从你的问题很清楚,我为什么你需要担心 DIV 标签 - 什么这样做只是:

It's not clear to me from your question why you need to worry about the div tags -- what about doing just:

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

在你给的HTML,运行此发出准确:

On the HTML you give, running this emits exactly:

####I want whatever is located here ###

这似乎是你想要的。也许你可以指定更好的到底是什么,你需要和这个超级简单的片断没有做 - 多个 D 标签的所有类作者而你只需要考虑(全部?只是一些?哪一个?),可能缺少任何此类标签(你想在这种情况下做什么),等等。很难推断究竟是你的规格,只是从这个简单的例子,overabundant code; - )

which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).

修改:如果按业务方案的最新评论,有多个这样的TD标签,每个作者之一:

Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

...即,完全不更难 - !)

...i.e., not much harder at all!-)

这篇关于Python网络刮涉及的HTML标签与属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆