Scrapy:从相对路径构造绝对路径的非重复列表 [英] Scrapy: constructing non-duplicative list of absolute paths from relative paths

查看:98
本文介绍了Scrapy:从相对路径构造绝对路径的非重复列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:如何使用Scrapy从 img src 标记下的相对路径创建绝对路径的非重复列表?

Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag?

背景:我正在尝试使用Scrapy爬网网站,拉出 img src 标记下的所有链接,将相对路径转换为绝对路径,然后生成CSV或列表数据类型的绝对路径.我计划将上述功能与使用Scrapy实际下载文件并同时爬行以获取链接结合起来,但是当我到达它时,我将跨越那座桥梁.作为参考,以下是有关假设目标站点的其他一些详细信息:

Background: I am trying to use Scrapy to crawl a site, pull any links under the img src tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For reference, here are some other details about the hypothetical target site:

  • 相对路径类似于 img src="/images/file1.jpg" ,其中 images 是目录( www.example.com/products/images ),无法直接抓取文件路径.
  • 这些图像的相对路径不遵循任何逻辑命名约定(例如,file1.jpg,file2.jpg,file3.jpg).
  • 不同文件的图像类型不同,其中PNG和JPG是最常见的.
  • The relative paths look like img src="/images/file1.jpg", where images is a directory (www.example.com/products/images) that cannot be directly crawled for file paths.
  • The relative paths for these images do not follow any logical naming convention (e.g., file1.jpg, file2.jpg, file3.jpg).
  • The image types differ across files, with PNG and JPG being the most common.

遇到的问题:即使在仔细阅读了Scrapy文档并经历了许多相当陈旧的Stackoverflow问题之后,例如[a,

Problems experienced: Even after thoroughly reading the Scrapy documentation and going through a ton of fairly dated Stackoverflow questions [e.g., this question], I can't seem to get the precise output I want. I can pull the relative paths and reconstruct them, but the output is off. Here are the issues I've noticed with my current code:

  • 在CSV输出中,既有填充行,也有空白行.我最好的猜测是,每一行都代表针对特定路径抓取特定页面的结果,这意味着空白行是负面结果.

  • In the CSV output, there are both populated rows and blank rows. My best guess is that each row represents the results of scraping a particular page for relative paths, which would mean a blank row is a negative result.

CSV中的每个非空白行都包含一个以逗号分隔的URL列表,而我只希望一行中的单个非重复值.列表中用逗号分隔的行的人口似乎支持了我对幕后情况的怀疑.

Each non-blank row in the CSV contains a list of URLs delimited by commas, whereas I would simply like an individual, non-duplicative value in a row. The population of a row with a comma-delimited list seems to support my suspicions about what is going on under the hood.

当前代码:我使用'scrapy crawl relpathfinder -o output.csv -t csv'在命令行中执行以下代码.

Current code: I execute the following code in the command line using 'scrapy crawl relpathfinder -o output.csv -t csv'.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class MySpider(CrawlSpider):
    name='relpathfinder'
    allowed_domains=['example.com']
    start_urls=['https://www.example.com/']
    rules = (Rule(LinkExtractor(allow=()), callback='url_join', follow=True),)

    def url_join(self,response):
        item=MyItem()
        item['url']=[]
        relative_url=response.xpath('//img/@src').extract()
        for link in relative_url:
            item['url'].append(response.urljoin(link))
        yield item

谢谢!

推荐答案

有关:

def url_join(self,response):
    item=MyItem()
    item['url']=[]
    relative_url=response.xpath('//img/@src').extract()
    for link in relative_url:
        item['url'] = response.urljoin(link)
        yield item

这篇关于Scrapy:从相对路径构造绝对路径的非重复列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆