剥离 xpath 中的附加项 [英] Stripping of the addiotional items in xpath

查看:30
本文介绍了剥离 xpath 中的附加项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从本网站中抓取项目.

I'm trying to scrape the items from this website.

项目是:品牌、型号和价格.由于页面结构的复杂性,spider 使用了 2 个 xpath 选择器.

Items are: Brand, Model and Price. Because of the complexity of the page structure, spider is using 2 xpath selectors.

品牌和型号商品来自一个 xpath,价格来自不同的 xpath.我正在使用 @har07 建议的 ( | ) 运算符.Xpaths 对每个项目都进行了单独测试,它们正在工作并正确提取所需的项目.但是,在加入 2 个 xpaths 后,输出到 csv 时,价格项目开始解析其他项目,例如逗号和价格与品牌/型号项目不匹配.

Brand and Model items are from one xpath, price is from the different xpath. I'm using ( | ) operator as @har07 suggested. Xpaths were tested individually for each item and they were working and extracting the needed items correctly. However, after joining the 2 xpaths, price item started parsing additional items, like commas and prices aren't matched with Brand/Model items, when outputting to csv.

这是蜘蛛的解析片段的样子:

This is how the parse fragment of the spider looks:

def parse(self, response):
    sel = Selector(response)
    titles = sel.xpath('//table[@border="0"]//td[@class="compact"] | //table[@border="0"]//td[@class="cl-price-cont"]//span[4]')
    
    items = []
    for t in titles:
        item = AltaItem()
        item["brand"] = t.xpath('div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)') 
        item["model"] = t.xpath('div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$') 
        item["price"] = t.xpath('text()').extract()

        items.append(item)

    return(items)

这就是 csv 在抓取后的样子:

and that's what csv looks after scraping:

有什么建议可以解决这个问题吗?

any suggestions how to fix this?

谢谢.

推荐答案

基本上,问题是由您的 titles xpath 引起的.xpath 下降得太深,以至于您需要使用连接两个 xpath 才能刮取品牌/型号字段和价格字段.

Basically, the issue is being caused by your titles xpath. The xpath goes down too deeply, to the point where you need to use join two xpaths in order to be able to scrape the brand/model field and the price field.

titles xpath 修改为单个 xpath 包括品牌/型号和价格的重复元素(并随后更改品牌、型号和价格 xpath)意味着您不再在以下位置出现不匹配品牌和型号在一个项目中,价格在下一个项目中.

Modifying the titles xpath to a single xpath includes both of the repeating elements for brand/model and price (and subsequently changing the brand, model and price xpaths) means that you no longer get mismatches where the brand and model are in one item, and the price is in the next item.

def parse(self, response):
    sel = Selector(response)
    titles = sel.xpath('//table[@class="table products cl"]//tr[@valign="middle"]')
    items = []
    for t in titles:
        item = AltaItem()
        item["brand"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
        item["model"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$')
        item["price"] = t.xpath('td[@class="cl-price-cont"]//span[4]/text()').extract()
        items.append(item)
    return(items)

这篇关于剥离 xpath 中的附加项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆