排除要抓取的元素 [英] excluding elements to be scraped

查看:32
本文介绍了排除要抓取的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从列表中排除某些元素.

在页面 http://www.persimmonhomes.com/rooley-park-10126 有一些我想废弃的元素是 (div class="housetype js-filter-housetype"),还有一些我不想废弃的元素是 (div class="housetype js-filter-housetype"style="display: none;")

html 看起来像:

<div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype" style="display: none;"><div class="housetype js-filter-housetype" style="display: none;">

我正在尝试编写代码来排除 div class="housetype js-filter-housetype" style="display: none;".

我当前执行此操作的代码是:

start_urls = ["http://www.persimmonhomes.com/rooley-park-10126",]定义解析(自我,响应):对于 response.xpath('//*[@id="aspnetForm"]/div[4]') 中的 sel:item = PersimmonItem()item['housetypeheading'] = sel.xpath('//*[@class="houses-list js-scrollable js-filterable js-houselist"]//*[not(@style="display: none;")]/h2[@class="housetype__heading"]').extract()产量项目

到目前为止,这不起作用.它只是删除所有元素,无论它是否具有部分(style="display: none;").我也试过 [not(contains(@style, "display: none;"))] - 但到目前为止没有运气.

我可以问任何想法吗?

解决方案

如果你想忽略所有带有样式属性的:

"//div[@class='housetype js-filter-housetype' 而不是(@style)]"

或者那个特定的样式,只需使用:

"//div[@class='housetype js-filter-housetype' and not(contains(@style,'display: none;'))]"

I am trying to exclude certain elements from a list.

on the page http://www.persimmonhomes.com/rooley-park-10126 there are the elements I want to scrap which are (div class="housetype js-filter-housetype") and there are those I don't want to scrap which are (div class="housetype js-filter-housetype" style="display: none;")

the html looks something like:

<div class="housetype js-filter-housetype"> 
<div class="housetype js-filter-housetype"> 
<div class="housetype js-filter-housetype"> 
<div class="housetype js-filter-housetype">
<div class="housetype js-filter-housetype"> 
<div class="housetype js-filter-housetype" style="display: none;">
<div class="housetype js-filter-housetype" style="display: none;">

I am trying to write code to exclude the div class="housetype js-filter-housetype" style="display: none;".

My current code to do this is:

start_urls = [
    "http://www.persimmonhomes.com/rooley-park-10126",
]

def parse(self, response):
    for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
        item = PersimmonItem()
        item['housetypeheading'] = sel.xpath('//*[@class="houses-list js-scrollable js-filterable js-houselist"]//*[not(@style="display: none;")]/h2[@class="housetype__heading"]').extract()
        yield item

so far, this does not work. It just scraps all the elements whether or not it has the part (style="display: none;"). I have also tried the [not(contains(@style, "display: none;"))] - but so far no luck.

may i ask for any ideas?

解决方案

If you want to ignore all with a style attribute:

"//div[@class='housetype js-filter-housetype' and not(@style)]"

Or that particular style, just use and:

"//div[@class='housetype js-filter-housetype' and not(contains(@style,'display: none;'))]"

这篇关于排除要抓取的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆