Scrapy tbody标签返回一个空的答案,但里面有文本 [英] Scrapy tbody tag return an empty answer but has text inside

查看:700
本文介绍了Scrapy tbody标签返回一个空的答案,但里面有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图取消并抓取网站。数据在tbody标签(事件名称)中。
当我检查谷歌控制台时,tbody标签有文本数据,但是当我尝试删除它时,它会返回一个空的答案(也在scrapy外壳中测试过)。我检查了一个AJAX方法,因为它可能会影响脚本,但它似乎没有。



你知道为什么答案是空的吗?而tbody标签在源代码中有文本?



这是我的代码

  nom_robot ='ListeCAP'
domaine = ['www.justrunlah.com']
base_url = [
https:// www.justrunlah.com/running-events-calendar-malaysia,
https://www.justrunlah.com/running-events-calendar-australia,
]

class ListeCourse_level1(scrapy.Spider):
name = nom_robot
allowed_domains = domaine
start_urls = base_url

def parse(self,response):

selector = Selector(response)

for unElement in response.xpath('// * [@ id =td-outer-wrap] / div [3] / div / div / div [1] / div / div [2] / div [3] / table / tbody / tr'):
loader = ItemLoader(JustrunlahItem(),selector = unElement)
loader.add_xpath('eve_nom_evenement','.// td [2] / div / div [1] / div / a / text()')

#define processors
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
yield loader.load_item()$ b $如果response.xpath('/ / a [@ class =smallpagination]'):
next_page = response.meta.get('page_number',1)+ 1
next_page_url ='{}?page = {}'。 (base_url,next_page)
yield scrapy.Request(next_page_url,callback = self.parse,meta = {'page_number':next_page})

终端窗口

  ['https:/ /www.justrunlah.com/running-events-calendar-malaysia/','https://www.justrunlah.com/running-events-calendar-australia/'] 
-------- ---------------------
2018-03-08 12:34:56 [scrapy。 utils.log]信息:Scrapy 1.4.0开始(bot:justrunlah)
2018-03-08 12:34:56 [scrapy.utils.log]信息:重写设置:{'BOT_NAME':'justrunlah' ,'NEWSPIDER_MODULE':'justrunlah.spiders','ROBOTSTXT_OBEY':True,'SPIDER_MODULES':['justrunlah.spiders']}
2018-03-08 12:34:56 [scrapy.middleware]信息:已启用的扩展程序:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-03-08 12:34:57 [scrapy.middleware]信息:启用下载中间件:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.download ermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-08 12:34:57 [scrapy.middleware]信息:启用蜘蛛中间件:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
' scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
NOM TABLE EN SORTIE:
import_brut _ ['www.justrunlah.com']
2018-03-08 12:34:57 [scrapy.middleware]信息:Enab导致项目管道:
['justrunlah.pipelines.JustrunlahPipeline']
2018-03-08 12:34:57 [scrapy.core.engine]信息:蜘蛛打开
2018-03- 08 12:34:57 [scrapy.extensions.logstats]信息:爬行0页(0页/分钟),刮0项(0项/分钟)
2018-03-08 12:34:57 [scrapy.extensions.telnet] DEBUG:telnet控制台监听127.0.0.1:6024
2018-03-08 12:34:57 [scrapy.core.engine] DEBUG:Crawled(200)< GET https: //www.justrunlah.com/robots.txt> (referer:None)
2018-03-08 12:34:58 [scrapy.core.engine] DEBUG:Crawled(200)< GET https://www.justrunlah.com/running-events-calendar -malaysia /> (referer:None)
2018-03-08 12:34:58 [scrapy.core.engine] DEBUG:Crawled(200)< GET https://www.justrunlah.com/running-events-calendar - 澳大利亚/> (引用者:无)
----------------------------------------- ---------
SCRAPING DES ELEMENTS EVENTS
----------------------------- ---------------------
------------------------- -------------------------
清理DES元素事件
------------- -------------------------------------
2018-03-08 12:34: 58 [scrapy.core.engine]信息:关闭蜘蛛(已完成)
2018-03-08 12:34:58 [scrapy.statscollectors]信息:倾销Scrapy统计信息:
{'downloader / request_bytes' :849,
'downloader / request_count':3,
'downloader / request_method_count / GET':3,
'downloader / response_bytes':76317,
'downloader / response_count' :3,
'downloader / response_status_count / 200':3,
'finish_reason':'finished',
'finish_time':datetime.datetime(2018,3,8,11,34 ,
'log_count / DEBUG':4,
'log_count / INFO':7,
'response_received_count':3,
'调度器/出队':2,
'调度器/出队/内存':2,$ b $'调度器/入队':2,$ b $'调度器/入队/内存':2,
'start_time':datetime.datetime(2018,3,8,11,34,57,419191)}
2018-03-08 12:34:58 [scrapy.core.engine]信息:蜘蛛已关闭(完成)

和scrapy shell



解决方案

我假设您正在尝试选择所有的事件名称,如果是这样的话,你可以用它作为你的xpath // * [@ class =cal2table] / tbody / tr / td [2] / div / div [1] / div / a / text()



所以我认为你遇到的问题是由于你的xpath定义,重新尝试选择这是我可以给出的最佳答案。

提示,您可以在Chrome / Firefox控制台中使用以下命令来测试您的xpath:

$ x( '// * [@ class =cal2table] / tbody / tr / td [2] / div / div [1] / div / a / text()')



要在当前正在尝试加载项目时使用此项,请尝试使用以下代码段。我还没有测试过,所以你可能需要做小的调整。



for unElement in response.xpath('// * [@ class =cal2table] // tr'):
loader = ItemLoader(JustrunlahItem(),selector = unElement)
loader.add_xpath('eve_nom_evenement','.//td[2]/div/ div [1] / div / a / text()')


I try to scrap and crawl a website. The data is in the tbody tag (event names). When I check the google console, the tbody tag has text data, but when I try to scrap it it returns an empty answer (also tested in scrapy shell). I checked for an AJAX method, because it can bug the script, but it seems does not have it.

Do you have any idea why the answer is empty whereas the tbody tag has text inside source code ?

Here is my code

nom_robot = 'ListeCAP' 
domaine = ['www.justrunlah.com'] 
base_url = [
    "https://www.justrunlah.com/running-events-calendar-malaysia",
    "https://www.justrunlah.com/running-events-calendar-australia",
]

class ListeCourse_level1(scrapy.Spider):
    name = nom_robot
    allowed_domains = domaine
    start_urls = base_url 

    def parse(self, response):    

        selector = Selector(response)

        for unElement in response.xpath('//*[@id="td-outer-wrap"]/div[3]/div/div/div[1]/div/div[2]/div[3]/table/tbody/tr'): 
            loader = ItemLoader(JustrunlahItem(), selector=unElement)
            loader.add_xpath('eve_nom_evenement', './/td[2]/div/div[1]/div/a/text()')

        # define processors
            loader.default_input_processor = MapCompose(string) 
            loader.default_output_processor = Join()
            yield loader.load_item()
            yield loader.load_item()
            if response.xpath('//a[@class="smallpagination"]'):
                next_page = response.meta.get('page_number', 1) + 1
                next_page_url = '{}?page={}'.format(base_url, next_page)
                yield scrapy.Request(next_page_url, callback=self.parse, meta={'page_number': next_page}) 

The terminal window

['https://www.justrunlah.com/running-events-calendar-malaysia/', 'https://www.justrunlah.com/running-events-calendar-australia/']
-----------------------------
2018-03-08 12:34:56 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: justrunlah)
2018-03-08 12:34:56 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'justrunlah', 'NEWSPIDER_MODULE': 'justrunlah.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['justrunlah.spiders']}
2018-03-08 12:34:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-03-08 12:34:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-08 12:34:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
NOM TABLE EN SORTIE :
import_brut_['www.justrunlah.com']
2018-03-08 12:34:57 [scrapy.middleware] INFO: Enabled item pipelines:
['justrunlah.pipelines.JustrunlahPipeline']
2018-03-08 12:34:57 [scrapy.core.engine] INFO: Spider opened
2018-03-08 12:34:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-08 12:34:57 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-03-08 12:34:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/robots.txt> (referer: None)
2018-03-08 12:34:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-malaysia/> (referer: None)
2018-03-08 12:34:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.justrunlah.com/running-events-calendar-australia/> (referer: None)
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
--------------------------------------------------
                SCRAPING DES ELEMENTS EVENTS
--------------------------------------------------
2018-03-08 12:34:58 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-08 12:34:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 849,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 76317,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 8, 11, 34, 58, 593309),
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 3, 8, 11, 34, 57, 419191)}
2018-03-08 12:34:58 [scrapy.core.engine] INFO: Spider closed (finished)

And the scrapy shell

解决方案

I assume you are trying to select all the event names, if so you can use this as your xpath //*[@class="cal2table"]/tbody/tr/td[2]/div/div[1]/div/a/text()

So I believe the issue you are having is due to your xpath definitions, without any further information on what you're trying to select this is the best answer I can give.

A tip, you can use the following command in Chrome/Firefox console to test your xpath:
$x('//*[@class="cal2table"]/tbody/tr/td[2]/div/div[1]/div/a/text()')

To use this as you currently are trying to load the items in, then try the following snippet instead. I haven't tested this so you may need to make small adjustments.

for unElement in response.xpath('//*[@class="cal2table"]//tr'): loader = ItemLoader(JustrunlahItem(), selector=unElement) loader.add_xpath('eve_nom_evenement', './/td[2]/div/div[1]/div/a/text()')

这篇关于Scrapy tbody标签返回一个空的答案,但里面有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆