Django-dynamic-scraper 无法抓取数据 [英] Django-dynamic-scraper unable to scrape the data

查看:24
本文介绍了Django-dynamic-scraper 无法抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是动态刮板的新手,我使用了以下示例进行学习open_news.我已设置好所有内容,但它让我显示相同的错误:dynamic_scraper.models.DoesNotExist: RequestPageType 匹配查询不存在.

I am new to using dynamic scraper, and I have used the following sample for learningopen_news. I have everything set up but it keeps me showing the same error: dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.

2015-11-20 18:45:11+0000 [article_spider] ERROR: Spider error processing <GET https://en.wikinews.org/wiki/Main_page>
Traceback (most recent call last):
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 645, in _tick
    taskObj._oneWorkUnit()
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 491, in _oneWorkUnit
    result = next(self._iterator)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
    yield next(it)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
    for x in result:
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/spiders/django_spider.py", line 378, in parse
    rpt = self.scraper.get_rpt_for_scraped_obj_attr(url_elem.scraped_obj_attr)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/models.py", line 98, in get_rpt_for_scraped_obj_attr
    return self.requestpagetype_set.get(scraped_obj_attr=soa)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/query.py", line 334, in get
    self.model._meta.object_name
dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.

推荐答案

这是由于缺少REQUEST PAGE TYPES"造成的.每个SCRAPER ELEMS"都必须有自己的请求页面类型".

This is caused by "REQUEST PAGE TYPES" is missing. Each "SCRAPER ELEMS" must have it's own "REQUEST PAGE TYPES".

要解决此问题,请按照以下步骤操作:

To solve this problem, please follow the steps below:

  1. 登录管理页面(通常是http://localhost:8000/admin/)
  2. 转到首页 › Dynamic_Scraper › Scrapers › Wikinews Scraper(文章)
  3. 点击请求页面类型"下的添加另一个请求页面类型"
  4. 为每个(基础(文章))"、(标题(文章))"、(描述(文章))"和(网址(文章))"总共创建 4 个请求页面类型"

请求页面类型"设置

所有内容类型"都是HTML"

All "Content type" are "HTML"

所有请求类型"都是请求"

All "Request type" are "Request"

所有的方法"都是获取"

All "Method" are "Get"

对于页面类型",只需按顺序分配它们

For "Page type", just assign them in sequence like

(基础(文章))|主页

(base (Article)) | Main Page

(标题(文章))|详情页 1

(title (Article)) | Detail Page 1

(描述(文章)| 详情页2

(description (Article) | Detail Page 2

(url (文章)) |详情页 3

(url (Article)) | Detail Page 3

完成上述步骤后,您应该修复DoesNotExist: RequestPageType"错误.

After the steps above you should fix "DoesNotExist: RequestPageType" error.

但是,错误:缺少必需的元素标题!"会出现!

However, "ERROR: Mandatory elem title missing!" would come up!

为了解决这个问题.建议您将SCRAPER ELEMS"中的REQUEST PAGE TYPE"全部更改为Main Page",包括title (Article)".

To solve this. I suggest you changing all "REQUEST PAGE TYPE" in "SCRAPER ELEMS" to "Main Page" including "title (Article)".

然后更改 XPath 如下:

And then change the XPath as follow:

(基础(文章))|//td[@class="l_box"]

(base (Article)) | //td[@class="l_box"]

(标题(文章))|跨度[@class="l_title"]/a/@title

(title (Article)) | span[@class="l_title"]/a/@title

(描述(文章)| p/span[@class="l_summary"]/text()

(description (Article) | p/span[@class="l_summary"]/text()

(url (文章)) |跨度[@class="l_title"]/a/@href

(url (Article)) | span[@class="l_title"]/a/@href

毕竟,在命令提示符下运行 scrapy crawl article_spider -a id=1 -a do_action=yes.您应该能够抓取文章".您可以从首页 › Open_News › 文章

After all, run scrapy crawl article_spider -a id=1 -a do_action=yes on command prompt. You should be able to crawl the "Article". You may check it from Home › Open_News › Articles

享受~

这篇关于Django-dynamic-scraper 无法抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆