Scrapy:从相对路径构造绝对路径的非重复列表 [英] Scrapy: constructing non-duplicative list of absolute paths from relative paths

查看：98 发布时间：2020/11/24 2:39:07 python html python-3.x web-scraping scrapy

本文介绍了Scrapy:从相对路径构造绝对路径的非重复列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题:如何使用Scrapy从 img src 标记下的相对路径创建绝对路径的非重复列表?

Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag?

背景:我正在尝试使用Scrapy爬网网站，拉出 img src 标记下的所有链接，将相对路径转换为绝对路径，然后生成CSV或列表数据类型的绝对路径.我计划将上述功能与使用Scrapy实际下载文件并同时爬行以获取链接结合起来，但是当我到达它时，我将跨越那座桥梁.作为参考，以下是有关假设目标站点的其他一些详细信息:

Background: I am trying to use Scrapy to crawl a site, pull any links under the img src tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For reference, here are some other details about the hypothetical target site:

相对路径类似于 img src="/images/file1.jpg" ，其中 images 是目录( www.example.com/products/images )，无法直接抓取文件路径.
这些图像的相对路径不遵循任何逻辑命名约定(例如，file1.jpg，file2.jpg，file3.jpg).
不同文件的图像类型不同，其中PNG和JPG是最常见的.

The relative paths look like img src="/images/file1.jpg", where images is a directory (www.example.com/products/images) that cannot be directly crawled for file paths.
The relative paths for these images do not follow any logical naming convention (e.g., file1.jpg, file2.jpg, file3.jpg).
The image types differ across files, with PNG and JPG being the most common.

遇到的问题:即使在仔细阅读了Scrapy文档并经历了许多相当陈旧的Stackoverflow问题之后，例如[a，

Problems experienced: Even after thoroughly reading the Scrapy documentation and going through a ton of fairly dated Stackoverflow questions [e.g., this question], I can't seem to get the precise output I want. I can pull the relative paths and reconstruct them, but the output is off. Here are the issues I've noticed with my current code:

在CSV输出中，既有填充行，也有空白行.我最好的猜测是，每一行都代表针对特定路径抓取特定页面的结果，这意味着空白行是负面结果.

In the CSV output, there are both populated rows and blank rows. My best guess is that each row represents the results of scraping a particular page for relative paths, which would mean a blank row is a negative result.

CSV中的每个非空白行都包含一个以逗号分隔的URL列表，而我只希望一行中的单个非重复值.列表中用逗号分隔的行的人口似乎支持了我对幕后情况的怀疑.

Each non-blank row in the CSV contains a list of URLs delimited by commas, whereas I would simply like an individual, non-duplicative value in a row. The population of a row with a comma-delimited list seems to support my suspicions about what is going on under the hood.

当前代码:我使用'scrapy crawl relpathfinder -o output.csv -t csv'在命令行中执行以下代码.

Current code: I execute the following code in the command line using 'scrapy crawl relpathfinder -o output.csv -t csv'.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class MySpider(CrawlSpider):
    name='relpathfinder'
    allowed_domains=['example.com']
    start_urls=['https://www.example.com/']
    rules = (Rule(LinkExtractor(allow=()), callback='url_join', follow=True),)

    def url_join(self,response):
        item=MyItem()
        item['url']=[]
        relative_url=response.xpath('//img/@src').extract()
        for link in relative_url:
            item['url'].append(response.urljoin(link))
        yield item

谢谢！

Scrapy:从相对路径构造绝对路径的非重复列表 [英] Scrapy: constructing non-duplicative list of absolute paths from relative paths

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Scrapy:从相对路径构造绝对路径的非重复列表 [英] Scrapy: constructing non-duplicative list of absolute paths from relative paths

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭