如何使用具有相同类名的scrapy刮取内容 [英] how to scrape content using scrapy which have same class name

查看：48 发布时间：2021/4/26 20:38:24 python css xpath web-scraping scrapy

本文介绍了如何使用具有相同类名的scrapy刮取内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用scrapy从此抓取数据网站，但是从div中删除具有相同类名的内容时出现问题.

I am using scrapy to scrape data from this website but i'm having issue when scraping content from div which have same class name.

<div class="list">
   <a id="followed_by" name="followed_by"></a>
  <h4 class="li_group">Followed by</h4>
  <div class="soda odd"><a href="http://www.imdb.com/title/tt0094450">Dirty Dancing</a></div>
  <div class="soda even"><a href="http://www.imdb.com/title/tt0338096">Dirty Dancing: Havana Nights</a></div>
   <a id="version_of" name="version_of"></a>
  <h4 class="li_group">Version of</h4>
  <div class="soda odd"><a href="http://www.imdb.com/title/tt5262792">Dirty Dancing</a></div>
   <a id="remade_as" name="remade_as"></a>
  <h4 class="li_group">Remade as</h4>
  <div class="soda odd"><a href="http://www.imdb.com/title/tt0461062">Holiday</a></div>
</div>

我尝试使用 xpath ，但是在尝试从多个页面抓取时遇到了困难.例如，当我尝试从此抓取我用于第一页的xpath不起作用.
这是我尝试的代码:

i tried to use xpath but I'm having hard time when I'm trying to scrape from multiple pages. for example, when I'm trying to scrape from this the xpath i used for the first page doesn't work.
Here Is The code I tried:

class ImdbSpider(scrapy.Spider):
    name = "IMDB"
    allowed_domains = ["http://www.imdb.com"]
    start_urls = [l.strip() for l in open('1988.txt').readlines()]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)
        item = ImdbcoItem
        for sel in response.xpath('body'):
            item['Followed_by'] = sel.xpath('//*[@id="connections_content"]/div[2]/div[1]/a/text()').extract()
            item['version_of'] = sel.xpath('//*[@id="connections_content"]/div[2]/div[3]/a/text()').extract()
            item['Remade_as'] = sel.xpath('//*[@id="connections_content"]/div[2]/div[4]/a/text()').extract()
        return item

我希望我的输出像这样:
追随者:肮脏的舞蹈，肮脏的舞蹈:哈瓦那之夜
版本:肮脏的跳舞
重制为:假日
任何帮助都会真正有帮助！

I want my Output to be like this:
Followed By: Dirty Dancing, Dirty Dancing: Havana Nights
Version of: Dirty Dancing
Remade as: Holiday
Any Help Would Be Really Helpful!!

推荐答案

尝试一下.我希望它能解决问题:

Give this a try. I hope it will solve the issue:

for sel in response.css("div.list"):
    item['Followed_by'] = sel.css("a#followed_by+h4.li_group+div.odd a::text").extract()
    item['version_of'] = sel.css("a#version_of+h4.li_group+div.odd a::text").extract()
    item['Remade_as'] = sel.css("a#remade_as+h4.li_group+div.odd a::text").extract()
return item

如果关注者"未能给您所有结果，请尝试:

If "Followed by" fail to give you all the results then try it:

item['Followed_by'] = sel.css("a#followed_by+h4.li_group+div.odd a::text , a#followed_by+h4.li_group+div.odd+div.even a::text").extract()

这篇关于如何使用具有相同类名的scrapy刮取内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用具有相同类名的scrapy刮取内容 [英] how to scrape content using scrapy which have same class name

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何使用具有相同类名的scrapy刮取内容 [英] how to scrape content using scrapy which have same class name

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭