使用Python&获取数据xml文件 [英] Fetching data using Python & lxml

查看:96
本文介绍了使用Python&获取数据xml文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的HTML.我想获取<span class="zzAggregateRatingStat">中的文本.根据下面给出的示例,我将得到3和5.

I have a my HTML which looks like below. I would like to get the text which is in the <span class="zzAggregateRatingStat">. According to the e.g given below I would get 3 and 5.

为此,我使用的是Python2.7& lxml

For this work I am using Python2.7 & lxml

<div class="pp-meta-review">
<span class="zrvwidget" style="">
    <span g:inline="true" g:type="NumUsersFoundThisHelpful" g:hideonnoratings="true" g:entity.annotation.groups="maps"    g:entity.annotation.id="http://maps.google.com/?q=Central+Kia+of+Irving++(972)+659-2204+loc:+1600+East+Airport+Freeway,+Irving,+TX+75062&gl=US&sll=32.83624,-96.92526" g:entity.annotation.author="AIe9_BH8MR-1JD_4BhwsKrGCazUyU5siqCtjchckDcg5BAl5rOLd9nvhJJDTrtjL-xFI8D42bD_7">
        <span class="zzNumUsersFoundThisHelpfulActive" zzlabel="helpful">
            <span>
                <span class="zzAggregateRatingStat">3</span>
            </span>
            <span>
                <span>&nbsp;</span>
                      out of
                <span>&nbsp;</span>
            </span>
            <span>
                <span class="zzAggregateRatingStat">5</span>
            </span>
            <span>
                <span>&nbsp;</span>
                    people found this review helpful.
            </span>
       </span>
   </span>
</span>
</div>

推荐答案

以下代码可用于您的输入:

The following code works with your input:

import lxml.html
root = lxml.html.parse('text.html').getroot()
for span in root.xpath('//span[@class="zzAggregateRatingStat"]'):
    print span.text

它打印:

3
5

相对于 CSSSelectors ,我更喜欢使用lxml xpath ,尽管它们都可以胜任.

I prefer using lxml's xpath over CSSSelectors though they can both do the job.

ChrisP的示例显示3,但是如果您在实际输入中运行它,则会出现错误:

ChrisP's example prints 3 but if you run it on your actual input we get errors:

$ python chrisp.py
Traceback (most recent call last):
  File "chrisp.py", line 6, in <module>
    doc = fromstring(text)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 3, column 210

可以将ChrisP的代码更改为使用lxml.html.fromstring(这是更宽松的解析器)而不是lxml.etree.fromstring.

ChrisP's code can be changed to use lxml.html.fromstring - which is a more lenient parser - instead of lxml.etree.fromstring.

如果进行此更改,则会打印3.

If this change is made it prints 3.

这篇关于使用Python&amp;获取数据xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆