Scrapy - 从表中提取项目 [英] Scrapy - Extract items from table

查看:60
本文介绍了Scrapy - 从表中提取项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图绕过 Scrapy,但遇到了一些死胡同.

Trying to get my head around Scrapy but hitting a few dead ends.

我在一页上有 2 个表,我想从每个表中提取数据,然后移到下一页.

I have a 2 Tables on a page and would like to extract the data from each one then move along to the next page.

表格看起来像这样(第一个称为 Y1,第二个称为 Y2)并且结构相同.

Tables look like this (First one is called Y1, 2nd is Y2) and structures are the same.

<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
                                <h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">                    

                <table class="table table-striped table-hover table-curved">
                    <thead>
                        <tr>
                            <th class="tCol1" style="padding: 10px;">First Col Head</th>
                            <th class="tCol2" style="padding: 10px;">Second Col Head</th>
                            <th class="tCol3" style="padding: 10px;">Third Col Head</th>
                        </tr>
                    </thead>
                    <tbody>

                        <tr>
                            <td>Info 1</td>
                            <td>Monday 5 September, 2016</td>
                            <td>Friday 21 October, 2016</td>
                        </tr>
                        <tr class="vevent">
                            <td class="summary"><b>Info 2</b></td>
                            <td class="dtstart" timestamp="1477094400"><b></b></td>
                            <td class="dtend" timestamp="1477785600">
                            <b>Sunday 30 October, 2016</b></td>
                        </tr>
                        <tr>
                            <td>Info 3</td>
                            <td>Monday 31 October, 2016</td>
                            <td>Tuesday 20 December, 2016</td>
                        </tr>


                    <tr class="vevent">
                        <td class="summary"><b>Info 4</b></td>                      
                        <td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
                        <td class="dtend" timestamp="1483315200">
                        <b>Monday 2 January, 2017</b></td>
                    </tr>



                </tbody>
            </table>

如你所见,结构有点不一致,但只要我能得到每个 td 并输出到 csv,我就会成为一个快乐的人.

As you can see, the structure is a little inconsistent but as long as I can get each td and output to csv then I'll be a happy guy.

我尝试使用 xPath 但这只会让我更加困惑.

I tried using xPath but this only confused me more.

我最后一次尝试:

import scrapy

class myScraperSpider(scrapy.Spider):
name = "myScraper"

allowed_domains = ["mysite.co.uk"]
start_urls =    (
                'https://mysite.co.uk/page1/',
                )

def parse_products(self, response):
    products = response.xpath('//*[@id="Y1"]/table')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
       item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
       item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
       yield item

此处没有错误,但它只会返回有关抓取的大量信息,但没有实际结果.

No errors here but it just fires back lots of information about the crawl but no actual results.

更新:

  import scrapy

       class SchoolSpider(scrapy.Spider):
name = "school"

allowed_domains = ["termdates.co.uk"]
start_urls =    (
                'https://termdates.co.uk/school-holidays-16-19-abingdon/',
                )

  def parse_products(self, response):
  products = sel.xpath('//*[@id="Year1"]/table//tr')
 for p in products[1:]:
  item = dict()
  item['hol'] = p.xpath('td[1]/text()').extract_first()
  item['first'] = p.xpath('td[1]/text()').extract_first()
  item['last'] = p.xpath('td[1]/text()').extract_first()
  yield item

这给了我:IndentationError:意外缩进

如果我运行下面修改后的脚本(感谢@Granitosaurus)输出到 CSV (-o schoolDates.csv) 我得到一个空文件:

if I run the amended script below (thanks to @Granitosaurus) to output to CSV (-o schoolDates.csv) I get an empty file:

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse_products(self, response):
    products = sel.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[1]/text()').extract_first()
        item['last'] = p.xpath('td[1]/text()').extract_first()
        yield item

这是日志:

  • 2017-03-23 12:04:08 [scrapy.core.engine] 信息:Spider 打开2017-03-23 12:04:08 [scrapy.extensions.logstats] 信息:爬网 0页数(以 0 页/分钟),已抓取 0 个项目(以 0 个项目/分钟) 2017-03-2312:04:08 [scrapy.extensions.telnet] 调试:Telnet 控制台监听在 ... 2017-03-23 12:04:08 [scrapy.core.engine] 调试:爬行 (200)https://termdates.co.uk/robots.txt> (referer: None) 2017-03-2312:04:08 [scrapy.core.engine] 调试:爬行(200)https://termdates.co.uk/school-holidays-16-19-abingdon/>(参考:无) 2017-03-23 12:04:08 [scrapy.core.scraper] 错误:蜘蛛错误处理 https://termdates.co.uk/school-holidays-16-19-abingdon/> (参考:无) 回溯(最近一次调用最后一次):文件c:\python27\lib\site-packages\twisted\internet\defer.py",第 653 行,在 _ runCallbackscurrent.result = callback(current.result, *args, **kw) 文件 "c:\python27\lib\site-packages\scrapy-1.3.3-py2.7.egg\scrapy\spiders__init__.py",第 76 行,解析中raise NotImplementedError NotImplementedError 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Closing spider (finished) 2017-03-2312:04:08 [scrapy.statscollectors] 信息:倾销 Scrapy 统计数据:{'downloader/request_bytes': 467, 'downloader/request_count': 2,'下载器/request_method_count/GET': 2,下载器/响应字节":11311,下载器/响应计数":2,'downloader/response_status_count/200': 2, 'finish_reason':'finished', 'finish_time': datetime.datetime(2017, 3, 23, 12, 4, 8,845000), 'log_count/DEBUG': 3, 'log_count/ERROR': 1,'log_count/INFO': 7, 'response_received_count': 2,调度程序/出队":1,调度程序/出队/内存":1,'调度程序/入队':1,'调度程序/入队/内存':1,'spider_exceptions/NotImplementedError': 1, 'start_time':datetime.datetime(2017, 3, 23, 12, 4, 8, 356000)} 2017-03-23 12:04:08[scrapy.core.engine] 信息:Spider 关闭(完成)
  • 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Spider opened 2017-03-23 12:04:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-03-23 12:04:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on ... 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: Crawled (200) https://termdates.co.uk/robots.txt> (referer: None) 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: Crawled (200) https://termdates.co.uk/school-holidays-16-19-abingdon/> (referer: None) 2017-03-23 12:04:08 [scrapy.core.scraper] ERROR: Spider error processing https://termdates.co.uk/school-holidays-16-19-abingdon/> (referer: None) Traceback (most recent call last): File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _ runCallbacks current.result = callback(current.result, *args, **kw) File "c:\python27\lib\site-packages\scrapy-1.3.3-py2.7.egg\scrapy\spiders__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Closing spider (finished) 2017-03-23 12:04:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 467, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 11311, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 845000), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 356000)} 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Spider closed (finished)

更新 2:(跳过行)这会将结果推送到 csv 文件,但每隔一行跳过一次.

Update 2: (Skips row) This pushes result to csv file but skips every other row.

壳牌显示{'hol':无,'last':u'\r\n\t\t\t\t\t\t\t\t','first':无}

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[2]/text()').extract_first()
        item['last'] = p.xpath('td[3]/text()').extract_first()
        yield item

解决方案:感谢@vold这会抓取 start_urls 中的所有页面并处理不一致的表格布局

Solution: Thanks to @vold This crawls all pages in start_urls and deals with the inconsistent table layout

# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
              'https://termdates.co.uk/school-holidays-3-dimensions',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]:
        item = Schooldates1Item()
        item['hol'] = product.xpath('td[1]//text()').extract_first()
        item['first'] = product.xpath('td[2]//text()').extract_first()
        item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
        item['url'] = response.url
        yield item

推荐答案

您需要稍微更正您的代码.由于您已经选择了表格中的所有元素,因此您无需再次指向表格.因此,您可以将 xpath 缩短为这样的td[1]//text().

You need to slightly correct your code. Since you already select all elements within the table you don't need to point again to a table. Thus you can shorten your xpath to something like thistd[1]//text().

def parse_products(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('td[1]//text()').extract_first()
       item['first'] = product.xpath('td[2]//text()').extract_first()
       item['last'] = product.xpath('td[3]//text()').extract_first()
       yield item

编辑我的答案,因为@stutray 提供了一个网站的链接.

Edited my answer since @stutray provide the link to a site.

这篇关于Scrapy - 从表中提取项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆