缺少html标签时如何使用scrapy提取标签值列表 [英] how to extract a list of label value with scrapy when html tag are missing

查看:33
本文介绍了缺少html标签时如何使用scrapy提取标签值列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用

处理文档

标签1value1 
<b>标签2value2 <br>....

我想不出一个干净的方法来使用scrapy来处理xpath.这是我最好的实现

hxs = HtmlXPathSelector(response)部分 = hxs.select(................)values = section.select("text()[preceding-sibling::b/text()]"):标签 = section.select("text()/preceding-sibling::b/text()"):

但我对这种通过索引匹配两个列表的节点的方法不满意.我宁愿遍历 1 个列表(值或标签)并将匹配的节点查询为相对 xpath.如:

values = section.select("text()[preceding-sibling::b/text()]"):对于值中的值:value.select("/preceding-sibling::b/text()"):

我一直在调整这个表达式,但始终没有返回匹配项

更新

我正在寻找能够容忍噪音"的强大方法,例如:

garbage1
<b>标签1value1
<b>标签2value2 <br>垃圾 2
<b>标签3value3 <br><div>garbage3</div>

解决方案

抱歉我使用了 lxml,但它与 Scrapy 自己的选择器的工作方式相同.

对于您提供的特定 HTML,这将起作用:

<预><代码>>>>s = """<b> label1 </b>... value1 <br>... <b>标签2... value2 <br>……">>>>>>导入 lxml.html>>>lxml.html.fromstring(s)<0x10fdcadd0处的元素跨度>>>>汤 = lxml.html.fromstring(s)>>>汤.xpath("//text()")[' label1 ', '\nvalue1 ', ' label2 ', '\nvalue2 ']>>>res = soup.xpath("//text()")>>>对于 xrange(0, len(res), 2) 中的 i:... 打印 res[i:i+2]...[' 标签 1 ', '\n值 1 '][' label2 ', '\nvalue2 ']>>>

编辑 2:

<预><代码>>>>bs = etree.xpath("//text()[preceding-sibling::b/text()]")>>>对于 b 中的 b:...如果 b.getparent().tag == "b":... 打印 [b.getparent().text, b]...[' 标签 1 ', '\n值 1 '][' label2 ', '\nvalue2 '][' label3 ', '\nvalue3 ']

同样值得一提的是,如果您要在 for 循环内的 xpath 中执行./foo",而不是/foo",那么您要循环选择的元素.

I am currently processing a document with

<b> label1 </b>
value1 <br>
<b> label2 </b>
value2 <br>
....

I can't figure out a clean approach to xpath with scrapy. here is my best implementation

hxs = HtmlXPathSelector(response)

section = hxs.select(..............)
values = section.select("text()[preceding-sibling::b/text()]"):
labels = section.select("text()/preceding-sibling::b/text()"):

but I am not comfortable with this approach for matching nodes of both lists through index. I'd rather iterate through 1 list ( values or labels) and query the matching nodes as relative xpath. Such as :

values = section.select("text()[preceding-sibling::b/text()]"):
for value in values:
    value.select("/preceding-sibling::b/text()"):

I have been tweaking this expression but always return no matchs

UPDATE

I am looking for robust method that will tolerate "noise", e.g. :

garbage1<br>
<b> label1 </b>
value1 <br>
<b> label2 </b>
value2 <br>
garbage2<br>
<b> label3 </b>
value3 <br>
<div>garbage3</div>

解决方案

Edit: sorry I use lxml, but it will work the same with Scrapy's own selector.

For the specific HTML you have given this will work:

>>> s = """<b> label1 </b>
... value1 <br>
... <b> label2 </b>
... value2 <br>
... """
>>> 
>>> import lxml.html
>>> lxml.html.fromstring(s)
<Element span at 0x10fdcadd0>
>>> soup = lxml.html.fromstring(s)
>>> soup.xpath("//text()")
[' label1 ', '\nvalue1 ', ' label2 ', '\nvalue2 ']
>>> res = soup.xpath("//text()")
>>> for i in xrange(0, len(res), 2):
...     print res[i:i+2]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
>>> 

Edit 2:

>>> bs = etree.xpath("//text()[preceding-sibling::b/text()]")
>>> for b in bs:
...     if b.getparent().tag == "b":
...         print [b.getparent().text, b]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
[' label3 ', '\nvalue3 ']

Also for what it's worth, if you are looping over selected elements you want to do "./foo" in your xpath inside the for loop, not "/foo".

这篇关于缺少html标签时如何使用scrapy提取标签值列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆