如何从网页中的多个表格中抓取内容 [英] How to scrape contents from multiple tables in a webpage

查看:58
本文介绍了如何从网页中的多个表格中抓取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网页中的多个表格中抓取内容,HTML 代码如下:

<h2 class="table-header">日期 1 </h2><table class="table-stats"><tr class='preview' id='match-row-EFBO755307'><td class='details'><p><span class='team-home team'><a href='random_team'>team 1</a></span><span class='team-away team'><a href='random_team'>团队 2</a></span></p></td></tr><tr class='preview' id='match-row-EFBO755307'><td class='match-details'><p><span class='team-home team'><a href='random_team'>team 3</a></span><span class='team-away team'><a href='random_team'>team 4</a></span></p></td></tr></tbody><h2 class="table-header">日期 2 </h2><table class="table-stats"><tr class='preview' id='match-row-EFBO755307'><td class='match-details'><p><span class='team-home team'><a href='random_team'>team X</a></span><span class='team-away team'><a href='random_team'>team Y</a></span></p></td></tr><tr class='preview' id='match-row-EFBO755307'><td class='match-details'><p><span class='team-home team'><a href='random_team'>Team A</a></span><span class='team-away team'><a href='random_team'>B队</a></span></p></td></tr></tbody>

在日期(9 或 2 或 1 取决于该日期进行的比赛)和编号下还有更多比赛.表数为 63(等于天数)

我想为每个日期提取球队之间的比赛,以及哪支球队是主场,哪支球队是客场.

我使用的是scrapy shell并尝试了以下命令:

 title = sel.xpath("//td[@class = 'match-details']")[0]l_home = title.xpath("//span[@class = 'team-home team']/a/text()").extract()

这打印了一个主队名单,这打印了一个所有客队名单,

 l_Away = title.xpath("//span[@class = 'team-away team']/a/text()").extract()

这给了我所有日期的列表:

sel.xpath("/html/body/div[3]/div/div/div/div[4]/div[2]/div/h2/text()").extract()

我想要的是在所有日期中获得当天进行的比赛(以及哪支球队是主场和客场)

我的 items.py 应该是这样的:

date = Field()home_team = Field()away_team2 = Field()

请帮我写parse函数和Item类.

提前致谢.

解决方案

以下是 scrapy shell 的示例逻辑:

<预><代码>>>>for table in response.xpath('//table[@class="table-stats"]'):... date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]...打印日期... 在 table.xpath('.//tr[@class="preview" and @id]') 中匹配:... home_team = match.xpath('.//span[@class="team-home team"]/a/text()').extract()[0]... away_team = match.xpath('.//span[@class="team-away team"]/a/text()').extract()[0]... 打印 home_team, away_team...日期 1团队 1 团队 2团队 3 团队 4日期 2X 队 Y 队A队 B队

parse() 方法中,您需要在内循环中实例化一个 Item 实例并yield 它:

def 解析(自我,响应):for table in response.xpath('//table[@class="table-stats"]'):date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]对于 table.xpath('.//tr[@class="preview" and @id]') 中的匹配:home_team = match.xpath('.//span[@class="team-home team"]/a/text()').extract()[0]away_team = match.xpath('.//span[@class="team-away team"]/a/text()').extract()[0]项目 = 我的项目()项目['日期'] = 日期item['home_team'] = home_teamitem['away_team'] = away_team产量项目

Myitem 的位置:

class MyItem(Item):日期 = 字段()home_team = Field()away_team = Field()

I want to scrape contents from multiple tables in a webpage and the HTML code goes like this :

<div class="fixtures-table full-table-medium" id="fixtures-data">             
    <h2 class="table-header"> Date 1    </h2>
    <table class="table-stats">
        <tbody>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>team 1</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>team 2</a>                
                        </span>
                    </p>
                </td>
            </tr>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='match-details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>team 3</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>team 4</a>                
                        </span>
                    </p>
                </td>
            </tr>
        </tbody>
    </table>

    <h2 class="table-header"> Date 2    </h2>
    <table class="table-stats">
        <tbody>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='match-details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>team X</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>team Y</a>                
                        </span>
                    </p>
                </td>
            </tr>
            <tr class='preview' id='match-row-EFBO755307'>
                <td class='match-details'>
                    <p>
                        <span class='team-home teams'>
                            <a href='random_team'>Team A</a>                
                        </span>                 
                        <span class='team-away teams'>
                            <a href='random_team'>Team B</a>                
                        </span>
                    </p>
                </td>
            </tr>
        </tbody>
    </table>
</div>

There are more matches under the dates (9 or 2 or 1 depending on the matches played on that date) and the no. of tables is 63 (which is equal to no. of days)

I want to extract, for each date, matches between teams and also which team is home and which team is away.

I was using the scrapy shell and tried following commands:

 title = sel.xpath("//td[@class = 'match-details']")[0] 
 l_home = title.xpath("//span[@class = 'team-home teams']/a/text()").extract()

This printed a list of the home teams and this printed a list of all the away teams,

 l_Away = title.xpath("//span[@class = 'team-away teams']/a/text()").extract()

This gave me a list for all the dates :

sel.xpath("/html/body/div[3]/div/div/div/div[4]/div[2]/div/h2/text()").extract()

What I want is for all dates get the matches that are played on a day (and also which team is home and away)

Should my items.py look like this:

date = Field()
home_team = Field()
away_team2 = Field()

Please help me to write the parse function and the Item class.

Thanks in advance.

解决方案

Here's an example logic from scrapy shell:

>>> for table in response.xpath('//table[@class="table-stats"]'):
...     date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]
...     print date
...     for match in table.xpath('.//tr[@class="preview" and @id]'):
...         home_team = match.xpath('.//span[@class="team-home teams"]/a/text()').extract()[0]
...         away_team = match.xpath('.//span[@class="team-away teams"]/a/text()').extract()[0]
...         print home_team, away_team
... 
 Date 1    
team 1 team 2
team 3 team 4
 Date 2    
team X team Y
Team A Team B

In the parse() method you would need to instantiate an Item instance in the inner loop and yield it:

def parse(self, response):
    for table in response.xpath('//table[@class="table-stats"]'):
        date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]
        for match in table.xpath('.//tr[@class="preview" and @id]'):
            home_team = match.xpath('.//span[@class="team-home teams"]/a/text()').extract()[0]
            away_team = match.xpath('.//span[@class="team-away teams"]/a/text()').extract()[0]

            item = MyItem()
            item['date'] = date
            item['home_team'] = home_team
            item['away_team'] = away_team
            yield item

where Myitem would be:

class MyItem(Item):
    date = Field()
    home_team = Field()
    away_team = Field()

这篇关于如何从网页中的多个表格中抓取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆