Scrapy xpath 不提取包含特殊字符 <%= 的 div [英] Scrapy xpath not extracting div containing special characters <%=

查看:39
本文介绍了Scrapy xpath 不提取包含特殊字符 <%= 的 div的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Scrapy 的新手.我正在尝试从以下 URL 中提取 h2 文本:'https://www.tysonprop.co.za/agents/'

我有两个问题:

  1. 我的 xpath 可以到达 script 元素,但是在 script 标签中找不到 h2 或 div 元素.我什至尝试将 HTML 文件保存到我的机器并抓取该文件,但发生了同样的问题.我已经三重检查了我的 xpath 代码,一切似乎都井井有条.

  2. 当网站在我的浏览器中显示时,branch.branch_name 解析为Tysen Properties Head Office".如何获得值(即Tysen Properties Head Office")而不是变量名称(branch.branch_name)?

我的 Python 代码:

导入scrapy类 TysonSpider(scrapy.Spider):name = 'tyson_spider'def start_requests(self):url = 'https://www.tysonprop.co.za/agents/'产生scrapy.Request(url=url, callback=self.parse)定义解析(自我,响应):script = response.xpath('//script[@id="id_branch_template"]')div = script.xpath('./div[contains(@class,"branch-container")]')h2 = div.xpath('/h2[contains(@class,"branch-name")]/text()').extract()产量{'branchName':h2}


HTML 摘录如下:

解决方案

branch.branch_name 看起来像 JSON 格式的地址吗?是否有加载您正在寻找的数据的调用?也许,让我们看看

通过浏览浏览器开发者工具,您可以在网络选项卡中找到请求,并在它们之间进行搜索,您将面临 这个 AJAX 调用,它准确地加载了您正在寻找的数据.所以:

导入json导入scrapy

class TysonSpider(scrapy.Spider):name = 'tyson_spider'

 def start_requests(self):url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'产生scrapy.Request(url=url, callback=self.parse)

 def parse(self, response):json_data = json.loads(response.text)分支名称 = json_data['分支']['分支名称']产量{'branchName':branch_name}

I am new to Scrapy. I am trying to extract the h2 text from the following URL: 'https://www.tysonprop.co.za/agents/'

I have 2 problems:

  1. My xpath can get to the script element, but it cannot find the h2 or the div elements inside the script tag. I've even tried saving the HTML file to my machine and scraping this file, but the same problem occurs. I have triple checked my xpath code, all seems in order.

  2. When the website is displayed in my browser, branch.branch_name resolves to "Tysen Properties Head Office". How would one get the value (i.e. "Tysen Properties Head Office") instead of the variable name (branch.branch_name)?

My Python code:

import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]')
        div = script.xpath('./div[contains(@class,"branch-container")]')
        h2 = div.xpath('/h2[contains(@class,"branch-name")]/text()').extract()
        yield {'branchName': h2}


HTML extract below:

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>

解决方案

Does branch.branch_name looks like a address in JSON format? is there a call which loads data you are looking for ? maybe, let's see

By looking through your browser developer tool you can find requests in network tab and by searching between them you will face this AJAX call which loads exactly the data you are looking for. so:

import json
import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        json_data = json.loads(response.text)
        branch_name = json_data['branch']['branch_name']
        yield {'branchName': branch_name}

这篇关于Scrapy xpath 不提取包含特殊字符 &lt;%= 的 div的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆