XPath:选择某些子节点 [英] XPath: Select Certain Child Nodes
问题描述
我正在使用 XPath
和 Scrapy
从电影网站 BoxOfficeMojo.com 中抓取数据.
作为一个一般性问题:我想知道如何在一个 Xpath
字符串中选择一个父节点的某些子节点.
根据我从中抓取数据的电影网页,有时我需要的数据位于不同的子节点,例如是否有链接.我将浏览大约 14000 部电影,所以这个过程需要自动化.
以this为例.我需要演员、导演和制片人.
这是导演的 Xpath
:注意:%s 对应于一个确定的索引,在那里可以找到该信息 - 在杰克逊的例子 director 中
位于 [1]
和 actors
位于 [2]
.
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
但是,如果存在指向导演页面的链接,这将是 Xpath
:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors 有点棘手,因为列出的后续 actor 包含 <br>
,可能是 /a
的子代或父代的子代/font
,所以:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
获取几乎所有的演员(除了那些带有 font/br
的演员).
现在,我相信这里的主要问题是有多个 //div[@class="mp_box_content"]
- 我所做的一切都有效,除了我最终也得到了一些数字来自其他 mp_box_content
.此外,我还添加了许多 try:
、except:
语句以获取所有内容(演员、导演、制片人都有和没有关联的链接).例如,以下是我的 Scrapy
演员代码:
actor = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).提炼()尝试:second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()对于 n 秒:actor.append(n)除了:actor = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
这是为了掩盖以下事实:第一个演员可能没有与他/她相关联的链接,随后的演员有,第一个演员可能与他/她有关联,但其他演员可能没有.
感谢您花时间阅读本文以及任何帮助我找到/解决此问题的尝试!如果需要更多信息,请告诉我.
我假设您只对文本内容感兴趣,而不对演员页面的链接等感兴趣.
这是一个直接使用lxml.html
(和一些lxml.etree
)的命题
首先,我建议您通过
td[1]
的文本内容选择td[2]
单元格,表达式如.//tr[starts-with(td[1], "Director")]/td[2]
用于说明Director"或Directors"其次,测试各种表达式有无
,有无
等,使代码难以阅读和维护,并且由于您只对文本内容感兴趣,您不妨使用
string(.//tr[starts-with(td[1], "Actor")]/td[2])
获取文本,或在所选元素上使用lxml.html.tostring(e, method="text", encoding=unicode)
对于多个名称的
<br>
问题,我的做法通常是修改包含目标内容的lxml
树以添加一个特殊的将字符格式化为<br>
元素的.text
或.tail
,例如\n
,使用lxml
的iter()
函数之一.这对其他 HTML 块元素很有用,例如
.
你可能会通过一些蜘蛛代码更好地理解我的意思:
from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelector导入 lxml.etree导入 lxml.html标记 = "|"def br2nl(树):对于树中的元素:对于 element.iter("br") 中的 elem:elem.text = 标记def extract_category_lines(树):如果树不是 None 和 len(tree):# 通过在 <br> 之后添加一个 MARKER 来修改树元素br2nl(树)# 使用 lxml 的 .tostring() 得到一个 unicode 字符串# 并在我们上面添加的标记上分割线# 所以我们得到了演员、制片人、导演的名单...返回 lxml.html.tostring(tree[0], method="text", encoding=unicode).split(MARKER)类 BoxOfficeMojoSpider(BaseSpider):name = "boxofficemojo"start_urls = ["http://www.boxofficemojo.com/movies/?id=actionjackson.htm","http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",]# 根据第一个单元格的文本内容定位第二个单元格XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')定义解析(自我,响应):root = lxml.html.fromstring(response.body)# 找到The Players"表player = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')# 我们在players"中只有一张表,所以 for 循环并不是真正必要的对于玩家中的玩家表:Director_cells = self.XPATH_CATEGORY_CELL(players_table,类别="导演")actor_cells = self.XPATH_CATEGORY_CELL(players_table,类别="演员")producer_cells = self.XPATH_CATEGORY_CELL(players_table,类别="生产者")writers_cells = self.XPATH_CATEGORY_CELL(players_table,类别="生产者")composers_cells = self.XPATH_CATEGORY_CELL(players_table,类别="作曲家")导演 = extract_category_lines(directors_cells)演员 = extract_category_lines(actors_cells)生产者 = extract_category_lines(producers_cells)作家 = extract_category_lines(writers_cells)作曲家 = extract_category_lines(composers_cells)打印董事:",董事打印演员:",演员打印生产者:",生产者打印作家:",作家打印作曲家:",作曲家# 在这里你当然应该填充scrapy items
当然可以简化代码,但我希望你能明白.
您当然可以使用 HtmlXPathSelector
做类似的事情(例如使用 string()
XPath 函数),但无需修改 <br> 的树;
(如何用 hxs 做到这一点?)它仅适用于您的情况下的非多个名称:
I'm using XPath
with Scrapy
to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath
string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath
to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director
is found at [1]
and actors
at [2]
.
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath
:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br>
included for subsequent actors listed, which may be the children of an /a
or children of the parent /font
, so:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br
).
Now, the main problem here, I believe, is that there are multiple //div[@class="mp_box_content"]
- everything I have works EXCEPT that I also end up getting some digits from other mp_box_content
. Also I have added numerous try:
, except:
statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy
code for actors:
actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.
I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html
(and a bit of lxml.etree
) directly
First, I recommend you select
td[2]
cells by the text content oftd[1]
, with expressions like.//tr[starts-with(td[1], "Director")]/td[2]
to account for "Director", or "Directors"Second, testing various expressions with or without
<font>
, with or without<a>
etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well usestring(.//tr[starts-with(td[1], "Actor")]/td[2])
to get the text, or uselxml.html.tostring(e, method="text", encoding=unicode)
on selected elementsAnd for the
<br>
issue for multiple names, the way I do is generally modify thelxml
tree containing the targetted content to add a special formatting character to<br>
elements'.text
or.tail
, for example a\n
, with one oflxml
'siter()
functions. This can be useful on other HTML block elements, like<hr>
for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector
of course (with the string()
XPath function for example), but without modifying the tree for <br>
(how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']
这篇关于XPath:选择某些子节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!