python爬虫如何批量爬取糗事百科段子

查看：99 发布时间：2017/9/6 2:35:11

本文介绍了python爬虫如何批量爬取糗事百科段子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题

刚学Python不会scrapy框架，就是想做个简单爬虫实现抓取前10页段子（前N页）。请问不用scrapy能有什么简单一些的代码能实现？之前有试过在page那里加for循环，但是也只能抓到一个页面，不知道怎么弄。

import urllib
import urllib2
import re

page = 1
url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
user_agent = 'Mozilla/5.0 ( Windows NT 6.1)'
headers = { 'User-Agent' : user_agent }

try:
    request = urllib2.Request(url,headers = headers)
    response = urllib2.urlopen(request)
    content = response.read().decode('utf-8')
    pattern = re.compile('<div.*?class="content">.*?<span>(.*?)</span>.*?</div>.*?',re.S)
    items = re.findall(pattern,content)
    for item in items:
        print item

except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason

解决方案

我跑了一下你的代码，发现能跑出前2个页面，后面都返回了一个错误码，我觉得是因为你没做防反爬处理，因为你这个结果在一秒内就跑出来了，一秒内连续10次访问肯定不是人能做到的。

很多网站都能知道你这是用代码在刷他们的网站，有些网站很讨厌这个，会做反爬处理，可能直接把你的 IP 都给封了，让你没法访问，因为如果不这样做，短时间内直接访问太多次的话可能会把人家的网站都弄瘫痪了。

我的建议是每爬完一个页面等待1秒，修改了下你的代码：

import urllib
import urllib2
import re
import time

for page in range(1, 11):
    print('at page %s' % page)
    url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
    user_agent = 'Mozilla/5.0 ( Windows NT 6.1)'
    headers = { 'User-Agent' : user_agent }

    try:
        request = urllib2.Request(url,headers = headers)
        response = urllib2.urlopen(request)
        content = response.read().decode('utf-8')
        pattern = re.compile('<div.*?class="content">.*?<span>(.*?)</span>.*?</div>.*?',re.S)
        items = re.findall(pattern,content)
        for item in items:
            print item

    except urllib2.URLError, e:
        if hasattr(e,"code"):
            print e.code
        if hasattr(e,"reason"):
            print e.reason
    
    time.sleep(1)

我这边是能出结果的，不过我想向你推荐另一个第三方的库，叫 requests，既然你会 urllib，这也就不难，但是使用起来更人性化，配合 BeatuifulSoup 库(用来解析和处理 HTML 文本的)很方便，你也可以去网上搜一下，了解一下。

还有就是以后做爬虫一定要注意做防反爬处理！

这篇关于python爬虫如何批量爬取糗事百科段子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python爬虫如何批量爬取糗事百科段子

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

python爬虫如何批量爬取糗事百科段子

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭