Scrapy:从相对路径构造绝对路径的非重复列表 [英] Scrapy: constructing non-duplicative list of absolute paths from relative paths
问题描述
问题:如何使用Scrapy从 img src
标记下的相对路径创建绝对路径的非重复列表?
Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src
tag?
背景:我正在尝试使用Scrapy爬网网站,拉出 img src
标记下的所有链接,将相对路径转换为绝对路径,然后生成CSV或列表数据类型的绝对路径.我计划将上述功能与使用Scrapy实际下载文件并同时爬行以获取链接结合起来,但是当我到达它时,我将跨越那座桥梁.作为参考,以下是有关假设目标站点的其他一些详细信息:
Background: I am trying to use Scrapy to crawl a site, pull any links under the img src
tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For reference, here are some other details about the hypothetical target site:
- 相对路径类似于
img src="/images/file1.jpg"
,其中 images 是目录( www.example.com/products/images ),无法直接抓取文件路径. - 这些图像的相对路径不遵循任何逻辑命名约定(例如,file1.jpg,file2.jpg,file3.jpg).
- 不同文件的图像类型不同,其中PNG和JPG是最常见的.
- The relative paths look like
img src="/images/file1.jpg"
, where images is a directory (www.example.com/products/images) that cannot be directly crawled for file paths. - The relative paths for these images do not follow any logical naming convention (e.g., file1.jpg, file2.jpg, file3.jpg).
- The image types differ across files, with PNG and JPG being the most common.
遇到的问题:即使在仔细阅读了Scrapy文档并经历了许多相当陈旧的Stackoverflow问题之后,例如[a,
Problems experienced: Even after thoroughly reading the Scrapy documentation and going through a ton of fairly dated Stackoverflow questions [e.g., this question], I can't seem to get the precise output I want. I can pull the relative paths and reconstruct them, but the output is off. Here are the issues I've noticed with my current code:
-
在CSV输出中,既有填充行,也有空白行.我最好的猜测是,每一行都代表针对特定路径抓取特定页面的结果,这意味着空白行是负面结果.
In the CSV output, there are both populated rows and blank rows. My best guess is that each row represents the results of scraping a particular page for relative paths, which would mean a blank row is a negative result.
CSV中的每个非空白行都包含一个以逗号分隔的URL列表,而我只希望一行中的单个非重复值.列表中用逗号分隔的行的人口似乎支持了我对幕后情况的怀疑.
Each non-blank row in the CSV contains a list of URLs delimited by commas, whereas I would simply like an individual, non-duplicative value in a row. The population of a row with a comma-delimited list seems to support my suspicions about what is going on under the hood.
当前代码:我使用'scrapy crawl relpathfinder -o output.csv -t csv'在命令行中执行以下代码.
Current code: I execute the following code in the command line using 'scrapy crawl relpathfinder -o output.csv -t csv'.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class MySpider(CrawlSpider):
name='relpathfinder'
allowed_domains=['example.com']
start_urls=['https://www.example.com/']
rules = (Rule(LinkExtractor(allow=()), callback='url_join', follow=True),)
def url_join(self,response):
item=MyItem()
item['url']=[]
relative_url=response.xpath('//img/@src').extract()
for link in relative_url:
item['url'].append(response.urljoin(link))
yield item
谢谢!
推荐答案
有关:
def url_join(self,response):
item=MyItem()
item['url']=[]
relative_url=response.xpath('//img/@src').extract()
for link in relative_url:
item['url'] = response.urljoin(link)
yield item
这篇关于Scrapy:从相对路径构造绝对路径的非重复列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!