Scrapy xpath无法提取包含特殊字符<%=的div [英] Scrapy xpath not extracting div containing special characters <%=

查看:57
本文介绍了Scrapy xpath无法提取包含特殊字符<%=的div的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Scrapy的新手.我正在尝试从以下URL中提取h2文本:'https://www.tysonprop.co.za/agents/'

I am new to Scrapy. I am trying to extract the h2 text from the following URL: 'https://www.tysonprop.co.za/agents/'

我有2个问题:

  1. 我的xpath可以到达script元素,但是找不到script标记内的h2或div元素.我什至尝试将HTML文件保存到我的机器上并抓取此文件,但是发生相同的问题.我已经三遍检查了我的xpath代码,一切似乎都井井有条.

  1. My xpath can get to the script element, but it cannot find the h2 or the div elements inside the script tag. I've even tried saving the HTML file to my machine and scraping this file, but the same problem occurs. I have triple checked my xpath code, all seems in order.

当网站显示在我的浏览器中时,branch.branch_name解析为"Tysen Properties Head Office".

When the website is displayed in my browser, branch.branch_name resolves to "Tysen Properties Head Office". How would one get the value (i.e. "Tysen Properties Head Office") instead of the variable name (branch.branch_name)?

我的Python代码:

import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]')
        div = script.xpath('./div[contains(@class,"branch-container")]')
        h2 = div.xpath('/h2[contains(@class,"branch-name")]/text()').extract()
        yield {'branchName': h2}


以下HTML摘录:


HTML extract below:

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>

推荐答案

branch.branch_name 看起来像是JSON格式的地址吗?是否有呼叫加载您要查找的数据?也许,让我们看看

Does branch.branch_name looks like a address in JSON format? is there a call which loads data you are looking for ? maybe, let's see

通过浏览器开发人员工具,您可以在网络"标签中找到请求,并且通过在它们之间进行搜索,您将面临

By looking through your browser developer tool you can find requests in network tab and by searching between them you will face this AJAX call which loads exactly the data you are looking for. so:

import json
import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        json_data = json.loads(response.text)
        branch_name = json_data['branch']['branch_name']
        yield {'branchName': branch_name}

这篇关于Scrapy xpath无法提取包含特殊字符&lt;%=的div的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆