XPath:选择某些子节点 [英] XPath: Select Certain Child Nodes

查看:38
本文介绍了XPath:选择某些子节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 XPathScrapy 从电影网站 BoxOfficeMojo.com 中抓取数据.

作为一个一般性问题:我想知道如何在一个 Xpath 字符串中选择一个父节点的某些子节点.

根据我从中抓取数据的电影网页,有时我需要的数据位于不同的子节点,例如是否有链接.我将浏览大约 14000 部电影,所以这个过程需要自动化.

this为例.我需要演员、导演和制片人.

这是导演的 Xpath:注意:%s 对应于一个确定的索引,在那里可以找到该信息 - 在杰克逊的例子 director 中 位于 [1]actors 位于 [2].

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

但是,如果存在指向导演页面的链接,这将是 Xpath:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

Actors 有点棘手,因为列出的后续 actor 包含 <br>,可能是 /a 的子代或父代的子代/font,所以:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

获取几乎所有的演员(除了那些带有 font/br 的演员).

现在,我相信这里的主要问题是有多个 //div[@class="mp_box_content"] - 我所做的一切都有效,除了我最终也得到了一些数字来自其他 mp_box_content.此外,我还添加了许多 try:except: 语句以获取所有内容(演员、导演、制片人都有和没有关联的链接).例如,以下是我的 Scrapy 演员代码:

 actor = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).提炼()尝试:second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()对于 n 秒:actor.append(n)除了:actor = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

这是为了掩盖以下事实:第一个演员可能没有与他/她相关联的链接,随后的演员有,第一个演员可能与他/她有关联,但其他演员可能没有.

感谢您花时间阅读本文以及任何帮助我找到/解决此问题的尝试!如果需要更多信息,请告诉我.

解决方案

我假设您只对文本内容感兴趣,而不对演员页面的链接等感兴趣.

这是一个直接使用lxml.html(和一些lxml.etree)的命题

你可能会通过一些蜘蛛代码更好地理解我的意思:

from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelector导入 lxml.etree导入 lxml.html标记 = "|"def br2nl(树):对于树中的元素:对于 element.iter("br") 中的 elem:elem.text = 标记def extract_category_lines(树):如果树不是 None 和 len(tree):# 通过在 <br> 之后添加一个 MARKER 来修改树元素br2nl(树)# 使用 lxml 的 .tostring() 得到一个 unicode 字符串# 并在我们上面添加的标记上分割线# 所以我们得到了演员、制片人、导演的名单...返回 lxml.html.tostring(tree[0], method="text", encoding=unicode).split(MARKER)类 BoxOfficeMojoSpider(BaseSpider):name = "boxofficemojo"start_urls = ["http://www.boxofficemojo.com/movies/?id=actionjackson.htm","http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",]# 根据第一个单元格的文本内容定位第二个单元格XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')定义解析(自我,响应):root = lxml.html.fromstring(response.body)# 找到The Players"表player = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')# 我们在players"中只有一张表,所以 for 循环并不是真正必要的对于玩家中的玩家表:Director_cells = self.XPATH_CATEGORY_CELL(players_table,类别="导演")actor_cells = self.XPATH_CATEGORY_CELL(players_table,类别="演员")producer_cells = self.XPATH_CATEGORY_CELL(players_table,类别="生产者")writers_cells = self.XPATH_CATEGORY_CELL(players_table,类别="生产者")composers_cells = self.XPATH_CATEGORY_CELL(players_table,类别="作曲家")导演 = extract_category_lines(directors_cells)演员 = extract_category_lines(actors_cells)生产者 = extract_category_lines(producers_cells)作家 = extract_category_lines(writers_cells)作曲家 = extract_category_lines(composers_cells)打印董事:",董事打印演员:",演员打印生产者:",生产者打印作家:",作家打印作曲家:",作曲家# 在这里你当然应该填充scrapy items

当然可以简化代码,但我希望你能明白.

您当然可以使用 HtmlXPathSelector 做类似的事情(例如使用 string() XPath 函数),但无需修改 <br> 的树;(如何用 hxs 做到这一点?)它仅适用于您的情况下的非多个名称:

<预><代码>>>>hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "导演")]/td[2])').extract()[u'Craig R. Baxley']>>>hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()[u'Carl WeathersCraig T. NelsonSharon Stone']

I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.

As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.

Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.

Using this as an example. I will need actor/s, director/s and producer/s.

This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

However, would a link exist to a page on the director, this would be the Xpath:

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

Actors are a bit more tricky, as there <br> included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

Gets all most all of the actors (except those with font/br).

Now, the main problem here, I believe, is that there are multiple //div[@class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:

 actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
 try:
     second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
     for n in second:
         actors.append(n)
 except:
     actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.

I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.

解决方案

I am assuming you are only interested in textual content, not the links to actors' pages etc.

Here is a proposition using lxml.html (and a bit of lxml.etree) directly

  • First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"

  • Second, testing various expressions with or without <font>, with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements

  • And for the <br> issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to <br> elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.

You may see better what I mean with some spider code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html

MARKER = "|"
def br2nl(tree):
    for element in tree:
        for elem in element.iter("br"):
            elem.text = MARKER

def extract_category_lines(tree):
    if tree is not None and len(tree):
        # modify the tree by adding a MARKER after <br> elements
        br2nl(tree)

        # use lxml's .tostring() to get a unicode string
        # and split lines on the marker we added above
        # so we get lists of actors, producers, directors...
        return lxml.html.tostring(
            tree[0], method="text", encoding=unicode).split(MARKER)

class BoxOfficeMojoSpider(BaseSpider):
    name = "boxofficemojo"
    start_urls = [
        "http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
        "http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
    ]

    # locate 2nd cell by text content of first cell
    XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # locate the "The Players" table
        players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')

        # we have only one table in "players" so the for loop is not really necessary
        for players_table in players:

            directors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Director")
            actors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Actor")
            producers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            writers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            composers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Composer")

            directors = extract_category_lines(directors_cells)
            actors = extract_category_lines(actors_cells)
            producers = extract_category_lines(producers_cells)
            writers = extract_category_lines(writers_cells)
            composers = extract_category_lines(composers_cells)

            print "Directors:", directors
            print "Actors:", actors
            print "Producers:", producers
            print "Writers:", writers
            print "Composers:", composers
            # here you should of course populate scrapy items

The code can be simplified for sure, but I hope you get the idea.

You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for <br> (how to do that with hxs?) it works only for non-multiple names in your case:

>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']

这篇关于XPath:选择某些子节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆