如何防止lxml.etree.HTML(数据)崩溃某些类型的数据? [英] How to prevent lxml.etree.HTML( data ) from crashing on certain type of data?

查看:266
本文介绍了如何防止lxml.etree.HTML(数据)崩溃某些类型的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行 etree.HTML(data),如下图所示,有许多不同的数据内容。然而,使用特定的数据 conent, lxml.etree.HTML 将不会解析它,而是进入无限循环并且消耗100%的CPU。



有没有人知道这个数据究竟是什么导致的?更重要的是,我如何防止这种情况发生在无数的随机破损的数据


编辑:结果是,这是一个lxml版本2.7.8及以下的错误(至少
)。更新到lxml 2.9.0,并且错误消失了。


编辑:我知道这构成一个无限循环,但这是不是我得到的不好的行为。它运行正常(作为一个无限循环)与一个健康的数据内容。由于不正常的数据内容,如下所示,发生的是循环将停止,RAM将开始填满,并且完成后,所有CPU进入等待状态。有关原始调试,请参阅此问题

 #!/ usr / bin / python 
# - * - 编码:utf-8 - * -


导入sys
从lxml import etree



data =''
<!DOCTYPE html>
< html xmlns =http://www.w3.org/1999/xhtmlxmlns:og =http://opengraphprotocol.org/schema/xmlns:fb =http:// www .facebook.com / 2008 / fbml>
< head>
< meta charset =UTF-8>
< title> 20最卑鄙的事情Gordon Ramsay已经说完了,完成了,排名 - Grub Street纽约< / title>

< link rel =alternatetype =application / rss + xmltitle =RSS 2.0href =http://feedproxy.google.com/nymag/grubstreet/> ;



< meta name =标题content =Gordon Ramsay已说和完成的20个最卑鄙的事情,排名/>
< meta name =keywordscontent =april bloomfield,el gordo,frank bruni,gordon ramsay,诉讼,列表,marcus samuelsson,mario batali,shitlist,spotted pig,sued/>

< meta name =descriptioncontent =种族主义,胖胖和素食欺骗。 />

< meta name =Bylinecontent =Sierra Tishgart/>
< meta name =Type_of_Featurecontent =/>
< meta name =Issue_Datecontent =2013年3月8日12:50/>
< meta name =related_storiescontent =Gordon Ramsay已经说出和完成的20个最卑鄙的事情,排名/>
< meta name =document_typecontent =Blog/>
< meta name =categorycontent =Lists/>

< link rel =image_srchref =http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg /a_146x97.jpg/>
< link rel =canonicalhref =http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.htmlid =canonical/>

< script>
var canonicalUrl =http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html;
< / script>



< meta name =content.tags.primarycontent =;网络 - 格鲁布街;城市 - 纽约市;标签列表/ >
< meta name =content.tagscontent =;标签 - april bloomfield,标签 - el gordo,标签 - frank bruni,标签 - gordon ramsay,标签 - 诉讼,标签 - 马库斯samuelsson,;标签 - 马里奥·巴巴利,标签 - shitlist,标签 - 发现的猪,标签起诉/>
< meta name =content.hierarchycontent =纽约市:Grub Street/>
< meta name =content.typecontent =Blog/>
< meta name =content.subtypecontent =博客条目/>


< meta property =fb:app_idcontent =206283005644/>
< meta property =og:titlecontent =Gordon Ramsay已说和完成的20个最卑鄙的事情,排名/>
< meta property =og:descriptioncontent =种族主义,胖胖和素食欺骗。 />
< meta property =og:imagecontent =http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97 .jpg/>
< meta property =og:urlcontent =http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html/>
< meta property =og:typecontent =article/>
< meta property =og:site_namecontent =Grub Street New York/>





< meta name =viewportcontent =width = 1020>

< link type =text / css =stylesheethref =http://cache.nymag.com/css/screen/grubstreet/grubstreet-core.cssmedia = 全部/>
< link type =text / css =stylesheethref =http://cache.nymag.com/css/screen/section/daily/slideshow.cssmedia =all/ >
< link type =text / css =stylesheethref =http://cache.nymag.com/css/screen/echo.cssmedia =all/>
< link type =text / css =stylesheethref =http://cache.nymag.com/css/screen/loginRegister.cssmedia =all/>
< link rel =stylesheethref =http://cache.nymag.com/css/screen/advertising.cssmedia =all/>
< link rel =快捷图标href =http://images.nymag.com/gfx/grubst/favicon.ico/>

< style type =text / css>
#adsplashtop,#pushdown {padding:5px 5px;}
#pushdown {border-top:1px solid#737373}
< / style>











< ! - [if IE 6]>
< link rel =stylesheethref =http://cache.nymag.com/css/screen/grubstreet/win-ie6.csstype =text / cssmedia =screen,projection />
<![endif] - >

<! - [if IE 7]>
< link rel =stylesheethref =http://cache.nymag.com/css/screen/grubstreet/win-ie7.csstype =text / cssmedia =screen,projection />
<![endif] - >




< script type =text / javascript>
var NYM = {};
NYM.config = {};
NYM.config.membership = {
service:nym
};
NYM.config.advertising = {
sitename:nym.grubstreet
};

< / script>




< script type =text / javascript>
var date ='2013年3月12日12:42:38'
var currDate = new Date(date);
var GRUBST = {};
if(!NYM){
var NYM = {};
NYM.config = {};
NYM.config.membership = {
service:nym
};
NYM.config.advertising = {
sitename:nym.grubstreet
};
}
< / script>
< script type =text / javascriptsrc =http://cache.nymag.com/scripts/modernizr-1.7.min.js>< / script>
< script type =text / javascriptsrc =http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js>< / script>
< script type =text / javascriptsrc =http://cache.nymag.com/scripts/jquery-ui-1.8.2.custom.min.js>< / script>
< script type =text / javascriptsrc =http://cache.nymag.com/scripts/ad_manager.js>< / script>
< script type =text / javascriptsrc =http://cache.nymag.com/js/2/global.js>< / script>
< script type =text / javascriptsrc =http://cache.nymag.com/scripts/skinTakeover.js>< / script>
< script type =text / javascriptsrc =http://cache.nymag.com/scripts/grubstreet-controls.js>< / scr
'''







n = 0
while True:
n + = 1

tree = etree.HTML(data)
m = tree.xpath(// meta [@property])

print' - ',n
我在m:
打印n
#print(i.attrib ['property'],i.attrib ['content'])

对于快速版本,您可以使用:

  import sys 
from lxml import etree

print(% - 20s:%s%('Python',sys.version_info))
print(% - 20s:%s% ('lxml.etree',etree.LXML_VERSION))
print(% - 20s:%s%('libxml used',etree.LIBXML_VERSION))
print(% - 20s: s%('libxml compiler',etree.LIBXML_COMPILED_VERSION))
print(% - 20s:%s%('libxslt used',etree.LIBXSLT_VERSION))
print(% - 20s :%s%('libxslt compile ,etree.LIBXSLT_COMPILED_VERSION))

我有:



pre> 操作系统:Ubuntu 12.10(AWS)
Python:sys.version_info(major = 2,minor = 7,micro = 3,releaselevel ='final' = 0)
lxml.etree:(3,1,0,0)
使用libxml:(2,7,8)
libxml编译:(2,7,8)
libxslt used:(1,1,26)
libxslt编译:(1,1,26)


解决方案

这是一种使用lxml解析部分HTML的方法。似乎可以解决似乎在libxml(2,7,8)或更旧版本的版本中出现的挂起问题:

  parser = LH.HTMLParser()
parser.feed(data)
root = parser.close()
m = root.xpath('// meta [@property]')






  import sys 
import lxml.html as LH
import lxml.etree as ET

data ='''
<!DOCTYPE html>
<! - [if lt IE 7]> < html class =ie6> <![endif] - >
<! - [if IE 7]> <! - [if IE 8]> < html class =ie8> <![endif] - >
<! - [if gt IE 8]><! - > < html> <! - <![endif] - >
< head profile =http://gmpg.org/xfn/11>
< meta charset =UTF-8>
< title>
消除的美国数据显示,在无人机现在开放的阿富汗空袭中,有4颗导弹中有1枚导弹:调查新闻局< / title>

< meta name =descriptioncontent =无人机数据已从空军网站擦除。

< meta name =generatorcontent =Magicalia 2010/>
< meta name =google-site-verificationcontent =bGFVI6kAZGjMNNiS6LGvBDWSGydwyWQI3gogCD4xP50/>

< link href =http://cdn-images.mailchimp.com/embedcode/slim-081711.css =stylesheettype =text / css>
< link rel =stylesheethref =http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/screen.csstype =text / cssmedia =screen ,投影/>
< link rel =stylesheethref =http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/print.csstype =text / cssmedia =print />
< link rel =stylesheethref =http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/style.css?3type =text / cssmedia =screen ,投影/>

<! - [if IE]>
< link rel =stylesheethref =http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/lib/ie.csstype =text / cssmedia = 屏幕,投影/>
<![endif] - >

<! - [if lt IE 7]>
< script defer type =text / javascriptsrc =http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/pngfix.js>< / script>
<![endif] - >

<! - [if gte IE 5.5]>
< script language =javaScriptsrc =http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/dhtml.jstype =text / javaScript>< / script>
<![endif] - >

< link rel =alternatetype =application / rss + xmltitle =调查新闻局RSS Feedhref =http://www.thebureauinvestigates.com/feed //>
< link rel =pingbackhref =http://www.thebureauinvestigates.com/xmlrpc.php/>

< link rel =alternatetype =application / rss + xmltitle =调查新闻局& raquo;被删除的美国数据显示,目前在阿富汗空袭中发射的4枚导弹中有1枚由无人机评论Feedhref =http://www.thebureauinvestigates.com/2013/03/12/erased-us-data-shows-1-in-4-missiles-in-afghan-airstrikes-now-fired- by-drone / feed //>
< link rel ='stylesheet'id ='mailchimp-css'href ='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/mailchimp.dev.css?ver = 3.5.1'type ='text / css'media ='all'/>
< link rel ='stylesheet'id ='donate-css'href ='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/donate.dev.css?ver = 3.5.1'type ='text / css'media ='all'/>
< link rel ='stylesheet'id ='tubepress-css'href ='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/css/tubepress。 css?ver = 3.5.1'type ='text / css'media ='all'/>
< link rel ='stylesheet'id ='NextGEN-css'href ='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/css/nggallery.css?ver = 1.0.0'type ='text / css'media ='screen'/>
< link rel ='stylesheet'id ='shutter-css'href ='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/shutter/shutter-reloaded.css? ver = 1.3.4'type ='text / css'media ='screen'/>
< link rel ='stylesheet'id ='stbCSS-css'href ='http://www.thebureauinvestigates.com/wp-content/plugins/wp-special-textboxes/css/wp-special- textboxes.css.php?ver = 4.3.72'type ='text / css'media ='all'/>
< link rel ='stylesheet'id ='grid-css'href ='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/grid.css?ver = 3.5.1'type ='text / css'media ='all'/>
< link rel ='stylesheet'id ='reveal-css'href ='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/reveal.css?ver = 3.5.1'type ='text / css'media ='all'/>
< link rel ='stylesheet'id ='app-css'href ='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/app.css?ver = 3.5.1'type ='text / css'media ='all'/>
< script type ='text / javascript'src ='http://www.thebureauinvestigates.com/wp-includes/js/jquery/jquery.js?ver = 1.8.3'>< /脚本>
< script type ='text / javascript'src ='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/js/tubepress.js?ver = 3.5 .1'>< / script>
< script type ='text / javascript'src ='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/jquery.cycle.js?ver = 3.5.1'> ;< / script>
< script type ='text / javascript'src ='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/search.js?ver = 3.5.1'>< ; / script>
< script type ='text / javascript'src ='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/superfish.js?ver = 3.5.1'> ;< / script>
< script type ='text / javascript'src ='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/supersubs.js?ver = 3.5.1'> ;< / script>
< script type ='text / javascript'src ='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/home.js?ver = 3.5.1'>< ; $ sc
''

如果__name__ =='__main__':

print(% - 20s:%s%('Python' sys.version_info))
print(% - 20s:%s%('lxml.etree',ET.LXML_VERSION))
print(% - 20s:%s%('libxml使用',ET.LIBXML_VERSION))
print(% - 20s:%s%('libxml compiler',ET.LIBXML_COMPILED_VERSION))
print(% - 20s:%s% 'libxslt used',ET.LIBXSLT_VERSION))
print(% - 20s:%s%('libxslt compiled',ET.LIBXSLT_COMPILED_VERSION))

n = 0
而True:
n + = 1
print' - ',n
parser = LH.HTMLParser()
parser.feed(data)
root = parser.close ()
m = root.xpath('// meta [@property]')
在我的m:
print(n)
/ pre>

 %test.py 
Python:sys.version_info(major = 2,minor = 7,micro = 2,releaselevel ='final',serial = 0)
lxml.etree:(2,3,0,0)
使用libxml:(2,7,8 )
libxml编译:(2,7,8)
libxslt used:(1,1,26)
libxslt编译:(1,1,26)
- 1
- 2
- 3
- 4
- 5
...


I'm running etree.HTML( data ) like below for lots of different data contents. With a specific data conent, however, lxml.etree.HTML will not parse it, but go into an infinite loop and consume 100% CPU.

Does anyone know exactly what in this data below that can be causing this? And more importantly, how can I prevent this from happening on an infinite number of random, broken data?

Edit: Turns out this is a bug with lxml version 2.7.8 and below (at least). Updated to lxml 2.9.0, and bug is gone.

Edit: I know this constitutes an infinite loop, but that's not the bad behaviour I'm getting. It runs fine (as an infinite loop) with a healthy data content. With unhealthy data content, like below, what happens is that the loop will STOP and RAM will start filling up and when it's full, all CPU goes into WAIT state. See this question for the original debug.

#!/usr/bin/python
# -*- coding: utf-8 -*-
#

import sys
from lxml import etree



data = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta charset="UTF-8">
    <title>The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked -- Grub Street New York</title>

    <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://feedproxy.google.com/nymag/grubstreet" />



    <meta name="Headline" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta name="keywords" content="april bloomfield, el gordo, frank bruni, gordon ramsay, lawsuits, lists, marcus samuelsson, mario batali, shitlist, spotted pig, sued" />

    <meta name="description" content="Racism, fat-shaming, and vegetarian trickery." />

    <meta name="Byline" content="Sierra Tishgart" />
    <meta name="Type_of_Feature" content="" />
    <meta name="Issue_Date" content="March  8, 2013 12:50 PM" />
    <meta name="related_stories" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta name="document_type" content="Blog" />
    <meta name="category" content="Lists" />

    <link rel="image_src" href="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg" />
    <link rel="canonical" href="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" id="canonical" />

    <script>
        var canonicalUrl = "http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html";
    </script>



    <meta name="content.tags.primary" content=";network - Grub Street,;city - New York City,;tag - lists" />
    <meta name="content.tags" content=";tag - april bloomfield,;tag - el gordo,;tag - frank bruni,;tag - gordon ramsay,;tag - lawsuits,;tag - marcus samuelsson,;tag - mario batali,;tag - shitlist,;tag - spotted pig,;tag - sued" />
    <meta name="content.hierarchy" content="New York City:Grub Street" />
    <meta name="content.type" content="Blog" />
    <meta name="content.subtype" content="Blog Entry" />    


    <meta property="fb:app_id" content="206283005644" />
    <meta property="og:title" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
    <meta property="og:description" content="Racism, fat-shaming, and vegetarian trickery." /> 
    <meta property="og:image" content="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg"/>
    <meta property="og:url" content="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" />
    <meta property="og:type" content="article" />
    <meta property="og:site_name" content="Grub Street New York" />





    <meta name="viewport" content="width=1020">

<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/grubstreet-core.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/section/daily/slideshow.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/echo.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/loginRegister.css" media="all" />
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/advertising.css" media="all" />
<link rel="shortcut icon" href="http://images.nymag.com/gfx/grubst/favicon.ico" />

<style type="text/css">
#adsplashtop,#pushdown {padding:5px 5px;}
#pushdown {border-top:1px solid #737373}
</style>











    <!--[if IE 6]>
    <link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie6.css" type="text/css" media="screen, projection" />
<![endif]-->

<!--[if IE 7]>
    <link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie7.css" type="text/css" media="screen, projection" />
<![endif]-->




    <script type="text/javascript">
        var NYM = {};
        NYM.config = {};
        NYM.config.membership = {
            "service":"nym"
        };
        NYM.config.advertising = {
            "sitename":"nym.grubstreet"
        };

    </script>




<script type="text/javascript">
    var date = 'March 12, 2013 12:42:38';
    var currDate=new Date(date);
    var GRUBST = {};
    if (!NYM) {  
        var NYM = {};
        NYM.config = {};
        NYM.config.membership = {
            "service":"nym"
        };
        NYM.config.advertising = {
             "sitename":"nym.grubstreet"
        };
    }
</script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/modernizr-1.7.min.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/jquery-ui-1.8.2.custom.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/ad_manager.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/js/2/global.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/skinTakeover.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/grubstreet-controls.js"></scr
'''







n = 0
while True:
    n += 1

    tree = etree.HTML( data )
    m = tree.xpath("//meta[@property]")

    print '-', n 
    for i in m:
        print n 
        #print (i.attrib['property'], i.attrib['content'])

For quick versions, you can use:

import sys
from lxml import etree

print("%-20s: %s" % ('Python',           sys.version_info))
print("%-20s: %s" % ('lxml.etree',       etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used',      etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled',  etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used',     etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

I've got:

OS                  : Ubuntu 12.10 (AWS)
Python              : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree          : (3, 1, 0, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)

解决方案

Here is a way to parse partial HTML using lxml. It seems to work-around the hanging problem which seems to occur in versions of libxml (2, 7, 8) or older:

    parser = LH.HTMLParser()
    parser.feed(data)
    root = parser.close()
    m = root.xpath('//meta[@property]')


import sys
import lxml.html as LH
import lxml.etree as ET

data = '''
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6"> <![endif]-->
<!--[if IE 7]>    <html class="ie7"> <![endif]-->
<!--[if IE 8]>    <html class="ie8"> <![endif]-->
<!--[if gt IE 8]><!--> <html> <!--<![endif]-->
<head profile="http://gmpg.org/xfn/11">
 <meta charset="UTF-8">
 <title>
     Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone: The Bureau of Investigative Journalism  </title>

 <meta name="description" content="Drone data has been wiped from the Air Force website.">

 <meta name="generator" content="Magicalia 2010" />
 <meta name="google-site-verification" content="bGFVI6kAZGjMNNiS6LGvBDWSGydwyWQI3gogCD4xP50" />

 <link href="http://cdn-images.mailchimp.com/embedcode/slim-081711.css" rel="stylesheet" type="text/css">
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/screen.css" type="text/css" media="screen, projection" />
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/print.css" type="text/css" media="print" />
 <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/style.css?3" type="text/css" media="screen, projection" />

 <!--[if IE]>
   <link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/lib/ie.css" type="text/css" media="screen, projection" />
 <![endif]-->

 <!--[if lt IE 7]>
   <script defer type="text/javascript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/pngfix.js"></script>
 <![endif]-->

 <!--[if gte IE 5.5]>
   <script language="javaScript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/dhtml.js" type="text/javaScript"></script>
 <![endif]-->

 <link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism RSS Feed" href="http://www.thebureauinvestigates.com/feed/" />
 <link rel="pingback" href="http://www.thebureauinvestigates.com/xmlrpc.php" />

 <link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism &raquo; Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone Comments Feed" href="http://www.thebureauinvestigates.com/2013/03/12/erased-us-data-shows-1-in-4-missiles-in-afghan-airstrikes-now-fired-by-drone/feed/" />
<link rel='stylesheet' id='mailchimp-css'  href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/mailchimp.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='donate-css'  href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/donate.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='tubepress-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/css/tubepress.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='NextGEN-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/css/nggallery.css?ver=1.0.0' type='text/css' media='screen' />
<link rel='stylesheet' id='shutter-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/shutter/shutter-reloaded.css?ver=1.3.4' type='text/css' media='screen' />
<link rel='stylesheet' id='stbCSS-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/wp-special-textboxes/css/wp-special-textboxes.css.php?ver=4.3.72' type='text/css' media='all' />
<link rel='stylesheet' id='grid-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/grid.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='reveal-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/reveal.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='app-css'  href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/app.css?ver=3.5.1' type='text/css' media='all' />
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/js/tubepress.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/jquery.cycle.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/search.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/superfish.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/supersubs.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/home.js?ver=3.5.1'></sc
'''

if __name__ == '__main__':

    print("%-20s: %s" % ('Python',           sys.version_info))
    print("%-20s: %s" % ('lxml.etree',       ET.LXML_VERSION))
    print("%-20s: %s" % ('libxml used',      ET.LIBXML_VERSION))
    print("%-20s: %s" % ('libxml compiled',  ET.LIBXML_COMPILED_VERSION))
    print("%-20s: %s" % ('libxslt used',     ET.LIBXSLT_VERSION))
    print("%-20s: %s" % ('libxslt compiled', ET.LIBXSLT_COMPILED_VERSION))

    n = 0
    while True:
        n += 1
        print '-', n
        parser = LH.HTMLParser()
        parser.feed(data)
        root = parser.close()
        m = root.xpath('//meta[@property]')
        for i in m:
            print(n)

yields

% test.py
Python              : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree          : (2, 3, 0, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)
- 1
- 2
- 3
- 4
- 5
...

这篇关于如何防止lxml.etree.HTML(数据)崩溃某些类型的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆