从包含在scrapy python中的javascript的div通过xpath抓取数据 [英] scrape data through xpath from div that contains javascript in scrapy python

查看:30
本文介绍了从包含在scrapy python中的javascript的div通过xpath抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy,我正在抓取网站并使用 xpath 抓取项目.但是一些 div 包含 javascript,所以当我使用 xpath 时,直到包含 javascript 代码的 div id 返回一个空列表,并且没有包括那个 div 元素(其中包含 javascript)可以获取 HTML 数据

HTML 代码

<div id="contentDetails"><div class="eventDetails"><h2><a href="javascript:;"onclick="jdevents.getEvent(117032)">一些数据</a>

蜘蛛代码

class ExampleSpider(BaseSpider):名称 = "示例"domain_name = "www.example.com"start_urls = ["http://www.example.com/jkl/index.php"]定义解析(自我,响应):hxs = HtmlXPathSelector(响应)required_data = hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]')

那么我怎样才能从上面提到的 h2 元素 内的 anchor tag 中获取 text(Some data) ,是否有任何替代从scrapy中包含javascript的元素中获取数据的方法

解决方案

<div id="contentDetails"><div class="eventDetails"><h2><a href="javascript:;"onclick="jdevents.getEvent(117032)">一些数据</a>

在这种情况下,问题不在于获取某些数据"字符串的 javascript 代码.

您需要获取子节点:

required_data = hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]/h2/a/text()')

或者使用string函数:

required_data = hxs.select('string(//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"])')

I am working on scrapy , i am scraping a site and using xpath to scrape items. But some of the div contains javascript, so when i used xpath until the div id that contains javascript code is returning an empty list,and without including that div element(which contains javascript) can able to fetch HTML data

HTML code

<div class="subContent2">    
   <div id="contentDetails">
       <div class="eventDetails">
            <h2>
                <a href="javascript:;" onclick="jdevents.getEvent(117032)">Some data</a>
            </h2>
       </div>
   </div>
</div> 

Spider Code

class ExampleSpider(BaseSpider):
    name = "example"
    domain_name = "www.example.com"
    start_urls = ["http://www.example.com/jkl/index.php"]


    def parse(self, response):
         hxs = HtmlXPathSelector(response)
         required_data = hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]')

So how can i get text(Some data) from the anchor tag inside the h2 element as mentioned above, is there any alternate way for fetching data from the elements that contains javascript in scrapy

解决方案

<div class="subContent2">    
   <div id="contentDetails">
       <div class="eventDetails">
            <h2>
                <a href="javascript:;" onclick="jdevents.getEvent(117032)">Some data</a>
            </h2>
       </div>
   </div>
</div> 

The problem is not the javascript code in this case to get 'Some data' string.

You need either to get the subnode:

required_data = hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]/h2/a/text()')

or use string function:

required_data = hxs.select('string(//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"])')

这篇关于从包含在scrapy python中的javascript的div通过xpath抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
前端开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆