排除要抓取的元素 [英] excluding elements to be scraped
问题描述
我试图从列表中排除某些元素.
在页面 http://www.persimmonhomes.com/rooley-park-10126 有一些我想废弃的元素是 (div class="housetype js-filter-housetype"),还有一些我不想废弃的元素是 (div class="housetype js-filter-housetype"style="display: none;")
html 看起来像:
<div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype"><div class="housetype js-filter-housetype" style="display: none;"><div class="housetype js-filter-housetype" style="display: none;">我正在尝试编写代码来排除 div class="housetype js-filter-housetype" style="display: none;".
我当前执行此操作的代码是:
start_urls = ["http://www.persimmonhomes.com/rooley-park-10126",]定义解析(自我,响应):对于 response.xpath('//*[@id="aspnetForm"]/div[4]') 中的 sel:item = PersimmonItem()item['housetypeheading'] = sel.xpath('//*[@class="houses-list js-scrollable js-filterable js-houselist"]//*[not(@style="display: none;")]/h2[@class="housetype__heading"]').extract()产量项目
到目前为止,这不起作用.它只是删除所有元素,无论它是否具有部分(style="display: none;").我也试过 [not(contains(@style, "display: none;"))] - 但到目前为止没有运气.
我可以问任何想法吗?
解决方案 如果你想忽略所有带有样式属性的:
"//div[@class='housetype js-filter-housetype' 而不是(@style)]"
或者那个特定的样式,只需使用和
:
"//div[@class='housetype js-filter-housetype' and not(contains(@style,'display: none;'))]"
I am trying to exclude certain elements from a list.
on the page http://www.persimmonhomes.com/rooley-park-10126 there are the elements I want to scrap which are (div class="housetype js-filter-housetype") and there are those I don't want to scrap which are (div class="housetype js-filter-housetype" style="display: none;")
the html looks something like:
<div class="housetype js-filter-housetype">
<div class="housetype js-filter-housetype">
<div class="housetype js-filter-housetype">
<div class="housetype js-filter-housetype">
<div class="housetype js-filter-housetype">
<div class="housetype js-filter-housetype" style="display: none;">
<div class="housetype js-filter-housetype" style="display: none;">
I am trying to write code to exclude the div class="housetype js-filter-housetype" style="display: none;".
My current code to do this is:
start_urls = [
"http://www.persimmonhomes.com/rooley-park-10126",
]
def parse(self, response):
for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
item = PersimmonItem()
item['housetypeheading'] = sel.xpath('//*[@class="houses-list js-scrollable js-filterable js-houselist"]//*[not(@style="display: none;")]/h2[@class="housetype__heading"]').extract()
yield item
so far, this does not work. It just scraps all the elements whether or not it has the part (style="display: none;"). I have also tried the [not(contains(@style, "display: none;"))] - but so far no luck.
may i ask for any ideas?
解决方案 If you want to ignore all with a style attribute:
"//div[@class='housetype js-filter-housetype' and not(@style)]"
Or that particular style, just use and
:
"//div[@class='housetype js-filter-housetype' and not(contains(@style,'display: none;'))]"
这篇关于排除要抓取的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文