网页爬虫 - pyspider为什么调试的时候能抓到内容，点run的时候就没有数据写入数据库呢？

查看：246 发布时间：2017/9/6 7:05:42 网页爬虫 python3.x pyspider

本文介绍了网页爬虫 - pyspider为什么调试的时候能抓到内容，点run的时候就没有数据写入数据库呢？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题

1、win10下装anaconda3环境，然后装了pyspider0.3.8（没有手动修复crawl_config不起作用的bug），写了个抓取网页新闻的project，经常调试的时候能抓到新闻，但是点run就没有数据写入数据库，很是奇怪，源码附后，请大神指正。

2、源码如下：

from pyspider.libs.base_handler import *
import time
import re

class Handler(BaseHandler):
    crawl_config = {
        'headers':{
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
        }
    }

    @every(minutes=30)
    def on_start(self):
        self.crawl('http://opinion.hexun.com/', callback=self.index_page,
                   fetch_type='js',js_script='''
                   function() {
                     setTimeout(document.getElementById('inflistBoxMore').getElementsByTagName('a')[0].click(),5000);
                     setTimeout(document.getElementById('inflistBoxMore').getElementsByTagName('a')[0].click(),5000);
                     setTimeout(document.getElementById('inflistBoxMore').getElementsByTagName('a')[0].click(),5000);
                   }
                    ''')

    def index_page(self, response):
        now_day = str(time.strftime('%Y%m%d',time.localtime()))
        for each in response.doc('.tit a,.newtit').items():
            news_day = ''
            if re.search('(2016-\d{2}-\d{2})', str(each.attr.href)) != None:
                news_day = re.search('(2016-\d{2}-\d{2})', str(each.attr.href)).group(1)
                news_day = re.sub('\-', '', news_day,count=2)
            if news_day == now_day:
                self.crawl(each.attr('href'), callback=self.detail_page)
        

    @config(priority=2)
    def detail_page(self, response):
        response.doc('.TRS_Editor > style').remove()
        if response.doc('.articleName h1').text() == '':
            title = response.doc('title')
        else:
            title = response.doc('.articleName h1').text()
        return {
            "1title": title,
            "3context": response.doc('.art_contextBox').text(),
            "2date": str(time.strftime('%Y-%m-%d %H:%M',time.localtime())),
        }

补充下内容，请作者再帮忙看下，谢谢。
1、这个是active task
SUCCESS opinion_hexun_com > data:,on_start 9 seconds ago 0.0+0.00ms +1
SUCCESS opinion_hexun_com > data:,on_finished 9 seconds ago
SUCCESS opinion_hexun_com > data:,_on_cronjob 27 minutes ago
SUCCESS opinion_hexun_com > data:,_on_get_info 36 minutes ago
2、detail page 有没有被执行到————————这个要怎么看？
3、detail page 是否成功，进去看日志 track.process 中的 result 是否有内容，再看 result 段是否有内容。如果有就是 result 展示页面的问题，你是否使用 mongodb？如果是，默认的 range 有问题，升级到 github master 版本试试。——————————初学小白，抱歉，日志这个在哪里看？报告大神，我没有使用mongodb，使用的是默认的sqlite3。

解决方案

看 active tasks ，任务是否真的运行起来了
detail page 有没有被执行到
detail page 是否成功，进去看日志 track.process 中的 result 是否有内容，再看 result 段是否有内容。如果有就是 result 展示页面的问题，你是否使用 mongodb？如果是，默认的 range 有问题，升级到 github master 版本试试。

如果你的脚本已经 RUN 过了，链接会被去重，使用
http://docs.pyspider.org/en/l...
避免

这篇关于网页爬虫 - pyspider为什么调试的时候能抓到内容，点run的时候就没有数据写入数据库呢？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

网页爬虫 - pyspider为什么调试的时候能抓到内容，点run的时候就没有数据写入数据库呢？

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

网页爬虫 - pyspider为什么调试的时候能抓到内容，点run的时候就没有数据写入数据库呢？

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭