从网页抓取并重新格式化为日历文件 [英] Scraping from web page and reformatting to a calender file

查看:339
本文介绍了从网页抓取并重新格式化为日历文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取此网站: http://stats.swehockey.se/ScheduleAndResults / Schedule / 3940

I'm trying to scrape this site: http://stats.swehockey.se/ScheduleAndResults/Schedule/3940

我已经得到了(感谢alecxe)检索日期和团队。

And I've gotten as far (thanks to alecxe) as retrieving the date and teams.

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            yield item

因此,我的下一步是过滤掉没有AIK或DjurgårdensIF的家庭游戏的任何东西。之后,我需要重新格式化为一个.ics文件,我可以添加到Google日历。

So, my next step is to filter out anything that ins't a home game of "AIK" or "Djurgårdens IF". After that I'll need to reformat to an .ics file which I can add to Google Calender.

编辑:所以我已经解决了几件事,很多做。我的代码现在看起来像这样..

So I've solved a few things but still has a lot to do. My code now looks like this..

# -*- coding: UTF-8 -*-
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            for string in item['teams']:

                teams = string.split('-') #split it

                home_team = teams[0]#.split(' ') #only the first name, e.g. just 'Djurgårdens' out of 'Djurgårdens IF'
                away_team = teams[1]
                #home_team[0] = home_team[0].replace(" ", "") #remove whitespace
                #home_team = home_team[0]

                if "AIK" in home_team:
                    for string in item['date']:
                            year = string[0:4]
                            month = string[5:7]
                            day = string[8:10]
                            hour = string[11:13]
                            minute = string[14:16]

                            print year, month, day, hour, minute, home_team, away_team  
                elif u"Djurgårdens" in home_team:
                    for string in item['date']:
                        year = string[0:4]
                        month = string[5:7]
                        day = string[8:10]
                        hour = string[11:13]
                        minute = string[14:16]

                        print year, month, day, hour, minute, home_team, away_team     

该代码打印出AIK,DjurgårdensIF和SkellefteåAIK的游戏。所以我的问题在这里显然是如何过滤出SkellefteåAIK游戏,如果有任何容易的方法,使这个程序更好。

That code prints out the games of "AIK", "Djurgårdens IF" and "Skellefteå AIK". So my problem here is obviously how to filter out "Skellefteå AIK" games and if there is any easy way to make this program better. Thoughts on this?

最好的问候!

推荐答案

只要猜测家庭游戏是与你正在寻找的团队的首先(在破折号之前)。

I'm just guessing that home games are the ones with the team you're looking for first (before the dash).

你可以在XPath或从python。如果你想在XPath中这样做,只需选择包含主队名称的行。

You can do this in XPath or from python. If you want to do it in XPath, only select the rows which contain the home team name.

//table[@class="tblContent"]/tr[
    contains(substring-before(.//td[3]/text(), "-"), "AIK")
  or
    contains(substring-before(.//td[3]/text(), "-"), "Djurgårdens IF")
]

您可以保存删除所有空格(包括换行符),我只是为了可读性添加它们。

You can savely remove all whitespace (including newlines), I just added them for readability.

对于python你应该能够做同样的事情,也许更简洁使用一些正则表达式。

For python you should be able to do much the same, maybe even more concise using some regular expressions.

这篇关于从网页抓取并重新格式化为日历文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆