Python 中的屏幕抓取 [英] Screen Scraping in Python

查看:67
本文介绍了Python 中的屏幕抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Python 中的屏幕抓取的整个概念还不熟悉,尽管我已经在 R 中完成了一些屏幕抓取.我正在尝试抓取 Yelp 网站.我正在尝试抓取 yelp 搜索返回的每个保险公司的名称.对于大多数抓取任务,我能够执行以下任务,但在解析 xml 时总是很困难.

I'm new to the whole concept of screen scraping in Python, although I've done a bit of screen scraping in R. I'm trying to scrape the Yelp website. I'm trying to scrape the names of each insurance agency which the yelp search returns. With most scraping tasks, I'm able to perform the following task, but always have a hard time going forward with parsing the xml.

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.yelp.com/search?find_desc=insurance+agency&ns=1&find_loc=Austin').read())

print soup

那么在抓取网站时,应该遵循哪些步骤?每次尝试抓取网站时,是否需要采取一系列必要措施?

So when scraping a site, what are the steps that one should follow? Is there a set of necessary actions that one needs to take each time they attempt to scrape a site?

我在 Ubuntu 10.10 上运行 Python 2.6

I'm running Python 2.6 on Ubuntu 10.10

我意识到这可能是常见问题解答中概述的一个糟糕的 SO 问题,但我希望有人可以提供一些一般准则和抓取网站时要考虑的事项.

I realize that this may be a poor SO question as outlined in the faq, but I'm hoping someone can provide some general guidelines and things to consider when scraping a site.

推荐答案

我建议阅读 xpath &试试这个scrapy教程.http://doc.scrapy.org/intro/tutorial.html.写一个像这样的蜘蛛相当容易

I'd recommend read up on xpath & try this scrapy tutorial. http://doc.scrapy.org/intro/tutorial.html . It is fairly easy to write a spider like this

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):
    name = "dmoz.org"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    for site in sites:
        title = site.select('a/text()').extract()
        link = site.select('a/@href').extract()
        desc = site.select('text()').extract()
        print title, link, desc

这篇关于Python 中的屏幕抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆