XPath:选择某些子节点 [英] XPath: Select Certain Child Nodes

查看：38 发布时间：2021/7/16 21:56:02 xpath scrapy

本文介绍了XPath:选择某些子节点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 XPath 和 Scrapy 从电影网站 BoxOfficeMojo.com 中抓取数据.

作为一个一般性问题:我想知道如何在一个 Xpath 字符串中选择一个父节点的某些子节点.

根据我从中抓取数据的电影网页，有时我需要的数据位于不同的子节点，例如是否有链接.我将浏览大约 14000 部电影，所以这个过程需要自动化.

以this为例.我需要演员、导演和制片人.

这是导演的 Xpath:注意:%s 对应于一个确定的索引，在那里可以找到该信息 - 在杰克逊的例子 director 中 位于 [1] 和 actors 位于 [2].

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

但是，如果存在指向导演页面的链接，这将是 Xpath:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

Actors 有点棘手，因为列出的后续 actor 包含  ，可能是 /a 的子代或父代的子代/font，所以:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

获取几乎所有的演员(除了那些带有 font/br 的演员).

现在，我相信这里的主要问题是有多个 //div[@class="mp_box_content"] - 我所做的一切都有效，除了我最终也得到了一些数字来自其他 mp_box_content.此外，我还添加了许多 try:、except: 语句以获取所有内容(演员、导演、制片人都有和没有关联的链接).例如，以下是我的 Scrapy 演员代码:

 actor = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor，)).提炼()尝试:second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()对于 n 秒:actor.append(n)除了:actor = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

这是为了掩盖以下事实:第一个演员可能没有与他/她相关联的链接，随后的演员有，第一个演员可能与他/她有关联，但其他演员可能没有.

感谢您花时间阅读本文以及任何帮助我找到/解决此问题的尝试！如果需要更多信息，请告诉我.

解决方案

我假设您只对文本内容感兴趣，而不对演员页面的链接等感兴趣.

这是一个直接使用lxml.html(和一些lxml.etree)的命题

首先，我建议您通过td[1]的文本内容选择td[2]单元格，表达式如.//tr[starts-with(td[1], "Director")]/td[2] 用于说明Director"或Directors"
其次，测试各种表达式有无，有无等，使代码难以阅读和维护，并且由于您只对文本内容感兴趣，您不妨使用 string(.//tr[starts-with(td[1], "Actor")]/td[2]) 获取文本，或在所选元素上使用 lxml.html.tostring(e, method="text", encoding=unicode)
对于多个名称的   问题，我的做法通常是修改包含目标内容的 lxml 树以添加一个特殊的将字符格式化为   元素的 .text 或 .tail，例如 \n，使用lxml 的 iter() 函数之一.这对其他 HTML 块元素很有用，例如
.

你可能会通过一些蜘蛛代码更好地理解我的意思:

from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelector导入 lxml.etree导入 lxml.html标记 = "|"def br2nl(树):对于树中的元素:对于 element.iter("br") 中的 elem:elem.text = 标记def extract_category_lines(树):如果树不是 None 和 len(tree):# 通过在 <br> 之后添加一个 MARKER 来修改树元素br2nl(树)# 使用 lxml 的 .tostring() 得到一个 unicode 字符串# 并在我们上面添加的标记上分割线# 所以我们得到了演员、制片人、导演的名单...返回 lxml.html.tostring(tree[0], method="text", encoding=unicode).split(MARKER)类 BoxOfficeMojoSpider(BaseSpider):name = "boxofficemojo"start_urls = ["http://www.boxofficemojo.com/movies/?id=actionjackson.htm","http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",]# 根据第一个单元格的文本内容定位第二个单元格XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')定义解析(自我，响应):root = lxml.html.fromstring(response.body)# 找到The Players"表player = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')# 我们在players"中只有一张表，所以 for 循环并不是真正必要的对于玩家中的玩家表:Director_cells = self.XPATH_CATEGORY_CELL(players_table,类别="导演")actor_cells = self.XPATH_CATEGORY_CELL(players_table,类别="演员")producer_cells = self.XPATH_CATEGORY_CELL(players_table,类别="生产者")writers_cells = self.XPATH_CATEGORY_CELL(players_table,类别="生产者")composers_cells = self.XPATH_CATEGORY_CELL(players_table,类别="作曲家")导演 = extract_category_lines(directors_cells)演员 = extract_category_lines(actors_cells)生产者 = extract_category_lines(producers_cells)作家 = extract_category_lines(writers_cells)作曲家 = extract_category_lines(composers_cells)打印董事:"，董事打印演员:"，演员打印生产者:"，生产者打印作家:"，作家打印作曲家:"，作曲家# 在这里你当然应该填充scrapy items

当然可以简化代码，但我希望你能明白.

您当然可以使用 HtmlXPathSelector 做类似的事情(例如使用 string() XPath 函数)，但无需修改  的树;(如何用 hxs 做到这一点?)它仅适用于您的情况下的非多个名称:

<预><代码>>>>hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "导演")]/td[2])').extract()[u'Craig R. Baxley']>>>hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()[u'Carl WeathersCraig T. NelsonSharon Stone']

I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.

As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.

Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.

Using this as an example. I will need actor/s, director/s and producer/s.

This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

However, would a link exist to a page on the director, this would be the Xpath:

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

Actors are a bit more tricky, as there   included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

Gets all most all of the actors (except those with font/br).

Now, the main problem here, I believe, is that there are multiple //div[@class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:

 actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
 try:
     second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
     for n in second:
         actors.append(n)
 except:
     actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.

I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.

解决方案

I am assuming you are only interested in textual content, not the links to actors' pages etc.

Here is a proposition using lxml.html (and a bit of lxml.etree) directly

First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without , with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements
And for the   issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to   elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.

You may see better what I mean with some spider code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html

MARKER = "|"
def br2nl(tree):
    for element in tree:
        for elem in element.iter("br"):
            elem.text = MARKER

def extract_category_lines(tree):
    if tree is not None and len(tree):
        # modify the tree by adding a MARKER after <br> elements
        br2nl(tree)

        # use lxml's .tostring() to get a unicode string
        # and split lines on the marker we added above
        # so we get lists of actors, producers, directors...
        return lxml.html.tostring(
            tree[0], method="text", encoding=unicode).split(MARKER)

class BoxOfficeMojoSpider(BaseSpider):
    name = "boxofficemojo"
    start_urls = [
        "http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
        "http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
    ]

    # locate 2nd cell by text content of first cell
    XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # locate the "The Players" table
        players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')

        # we have only one table in "players" so the for loop is not really necessary
        for players_table in players:

            directors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Director")
            actors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Actor")
            producers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            writers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            composers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Composer")

            directors = extract_category_lines(directors_cells)
            actors = extract_category_lines(actors_cells)
            producers = extract_category_lines(producers_cells)
            writers = extract_category_lines(writers_cells)
            composers = extract_category_lines(composers_cells)

            print "Directors:", directors
            print "Actors:", actors
            print "Producers:", producers
            print "Writers:", writers
            print "Composers:", composers
            # here you should of course populate scrapy items

The code can be simplified for sure, but I hope you get the idea.

You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for   (how to do that with hxs?) it works only for non-multiple names in your case:

>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']

这篇关于XPath:选择某些子节点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

XPath:选择某些子节点 [英] XPath: Select Certain Child Nodes

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

XPath:选择某些子节点 [英] XPath: Select Certain Child Nodes

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭