Scrapy xpath 在两个 h2 标签之间具有以下同级 [英] Scrapy xpath with following sibling between two h2 tags

查看:25
本文介绍了Scrapy xpath 在两个 h2 标签之间具有以下同级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个设计不佳的 HTML 页面,我试图使用 scrapy 从中提取数据.以下片段是我感兴趣的片段:

<h2 class="schoolName">商学院</h2><ul title=商学院研究生院系 - 第 1 部分"></ul><ul title=商学院研究生院系 - 第 2 部分"></ul><ul title=商学院研究生院系 - 第 3 部分"></ul><h2 class="schoolName">法学院</h2><ul title=法学院系 - 第 1 部分"></ul><ul title=法学院系 - 第 2 部分"></ul><h2 class="schoolName">医学院</h2><ul title=医学院系 - 第 1 部分"></ul></html>

我特别想知道学校的数量和每个学校下的部门数量.所以我找到了所有学校的列表如下:

<预><代码>>>>school = response.xpath('//h2[@class=schoolName"]/text()').getall()>>>学校【‘商学院’、‘法学院’、‘医学院’]

然后对于每所学校,我找到它们下的部门如下:

<预><代码>>>>对于学校学校:...打印(学校)... print(response.xpath(f'//h2[@class=schoolName"][text()[contains(.,{school}")]]/following-sibling::ul/@title').extract())...打印(-----------------------------")...商学院['商学院研究生院系 - 第 1 部分','商学院研究生院系 - 部分2', '研究生院系 - 第 3 部分', '法学院系 - 第 1 部分',法学院系 - 第 2 部分"、医学院系 - 第 1 部分"]-----------------------------法学院['法学院系 - 第 1 部分'、'法学院系 - 第 2 部分'、'医学院部门 - 第 1 部分']-----------------------------医学院['医学院系 - 第 1 部分']-----------------------------

这显然没有按预期工作,因为以下兄弟姐妹选择了所有 ul 标签,而不仅仅是两个 h2 标签之间的标签.我如何实现这一目标?

解决方案

一种技术是选择一个公共分隔元素来标记新信息块的开始,使用 count()preceding-sibling 来测量它的位置,然后选择所有具有相同数量(加一)的分隔线在兄弟之前的数据元素.

在 iPython shell 中:

In [1]: from lxml import etree在 [2] 中:string = '''...:<h2 class="schoolName">商学院</h2>...:<ul title=商学院研究生院系 - 第 1 部分"></ul>...:<ul title=商学院研究生院系 - 第 2 部分"></ul>...:<ul title=商学院研究生院系 - 第 3 部分"></ul>...:<h2 class="schoolName">法学院</h2>...:<ul title=法学院系 - 第 1 部分"></ul>...:<ul title=法学院系 - 第 2 部分"></ul>...:<h2 class="schoolName">医学院</h2>...:<ul title=医学院系 - 第 1 部分"></ul>...:</html>'''在 [3] 中:root = etree.fromstring(string)在 [4] 中:schools = root.xpath('//h2[@class="schoolName"]/text()')在 [5]:学校出[5]:[‘商学院’、‘法学院’、‘医学院’]在[6]中:对于学校中的学校:...:打印(学校)...: position = int(root.xpath(f'count(//h2[text()={school}"]/preceding-sibling::h2) + 1'))...:打印(f位置:{位置}")...: 打印 (root.xpath(f'//ul[count(preceding-sibling::h2) = {position}]/@title'))...:商学院职位:1['商学院研究生院系 - 第 1 部分'、'商学院研究生院系 - 第 2 部分'、'商学院研究生院系 - 第 3 部分']法学院职位:2['法学院系 - 第 1 部分'、'法学院系 - 第 2 部分']医学院职位:3['医学院系 - 第 1 部分']

I have a poorly designed HTML page from which I am trying to extract data using scrapy. The following snippet is the one that I am interested in:

<html>
    <h2 class="schoolName">Graduate School of Business</h2>
        <ul title="Graduate School of Business departments - part 1"></ul>
        <ul title="Graduate School of Business departments - part 2"></ul>
        <ul title="Graduate School of Business departments - part 3"></ul>
   <h2 class="schoolName">School of Law</h2>
       <ul title="School of Law departments - part 1"></ul>
       <ul title="School of Law departments - part 2"></ul>
  <h2 class="schoolName">School of Medicine</h2>
      <ul title="School of Medicine departments - part 1"></ul>
</html>

I specifically want to know the number of schools and the number of departments under each school. So I find the list of all schools as follows:

>>> schools = response.xpath('//h2[@class="schoolName"]/text()').getall()
>>> schools
['Graduate School of Business', 'School of Law', 'School of Medicine']

Then for each school I find the departments under them as follows:

>>> for school in schools:
...     print(school)
...     print(response.xpath(f'//h2[@class="schoolName"][text()[contains(.,"{school}")]]/following-sibling::ul/@title').extract())
...     print ("-----------------------------")
...
Graduate School of Business
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 
 2', 'Graduate School of Business departments - part 3', 'School of Law departments - part 1', 
 'School of Law departments - part 2', 'School of Medicine departments - part 1']
-----------------------------
School of Law
['School of Law departments - part 1', 'School of Law departments - part 2', 'School of Medicine 
departments - part 1']
-----------------------------
School of Medicine
['School of Medicine departments - part 1']
-----------------------------

This is obviously not working as expected since the following-sibling is selecting all ul tags and not just those between two h2 tags. How do I achieve this?

解决方案

One technique is to pick a common divider element that marks the beginning of a new block of info, use count() and preceding-sibling to measure its position, then select all the data elements that have the same number (plus one) of divider preceding siblings.

In an iPython shell:

In [1]: from lxml import etree

In [2]: string = '''<html>
   ...:     <h2 class="schoolName">Graduate School of Business</h2>
   ...:         <ul title="Graduate School of Business departments - part 1"></ul>
   ...:         <ul title="Graduate School of Business departments - part 2"></ul>
   ...:         <ul title="Graduate School of Business departments - part 3"></ul>
   ...:    <h2 class="schoolName">School of Law</h2>
   ...:        <ul title="School of Law departments - part 1"></ul>
   ...:        <ul title="School of Law departments - part 2"></ul>
   ...:   <h2 class="schoolName">School of Medicine</h2>
   ...:       <ul title="School of Medicine departments - part 1"></ul>
   ...: </html>'''

In [3]: root = etree.fromstring(string)

In [4]: schools = root.xpath('//h2[@class="schoolName"]/text()')

In [5]: schools
Out[5]: ['Graduate School of Business', 'School of Law', 'School of Medicine']

In [6]: for school in schools:
   ...:     print (school)
   ...:     position = int(root.xpath(f'count(//h2[text()="{school}"]/preceding-sibling::h2) + 1'))
   ...:     print (f"Position: {position}")
   ...:     print (root.xpath(f'//ul[count(preceding-sibling::h2) = {position}]/@title'))
   ...: 
Graduate School of Business
Position: 1
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 2', 'Graduate School of Business departments - part 3']
School of Law
Position: 2
['School of Law departments - part 1', 'School of Law departments - part 2']
School of Medicine
Position: 3
['School of Medicine departments - part 1']

这篇关于Scrapy xpath 在两个 h2 标签之间具有以下同级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆