我怎样才能抓住CData的出BeautifulSoup的 [英] How can i grab CData out of BeautifulSoup

查看:534
本文介绍了我怎样才能抓住CData的出BeautifulSoup的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网站,我是刮有着类似的结构如下。我希望能够抓住了CDATA块的信息。

我使用BeautifulSoup拉其他信息关闭页面,因此,如果该解决方案可以与工作

,这将有助于保持我的学习曲线下来,因为我是一个新手蟒蛇。
具体来说,我想在两种不同类型的隐藏在CDATA声明中的数据。首先这是只是文本,我pretty相信我可以扔了一个正则表达式,并得到我所需要的。对于第二类,如果我能删除具有HTML元素到它自己的beautifulsoup数据,我可以分析。

我只是学习Python和beautifulsoup,所以我在努力寻找神奇的咒语,这将使我只是在CDATA本身。

 <!DOCTYPE HTML PUBLIC -  // W3C // DTD XHTML 1.0过渡// ENhttp://www.w3.org/TR/xhtml1/DTD/ XHTML1-transitional.dtd>
< HTML的xmlns =htt​​p://www.w3.org/1999/xhtml>
< HEAD>
<标题>
   牛,羊
  < /标题>
< /头>
<身体GT;
 < D​​IV ID =主>
  < D​​IV ID =main- precontents>
   < D​​IV ID =主目录级=主目录>
    <脚本类型=文/ JavaScript的>
       !//< [CDATA [VAR _ = g_cow; _ [7654] = {cowname_enus:牛治!,leather_quality:99,图标:cow_level_23'}; _ [37357] = {sheepname_enus:咩呼吸 ,wool_quality:75,图标:sheep_level_23'}; _ [39654] .cowmeat_enus ='<表>< TR>< TD>< b类=Q4>牛治< / b> < BR>< / BR>
       <! - TS - >
       现在和LT得到它;表width=\"100%\"><tr><td>NOW</td><th>NOW</th></tr></table><span>244奶牛&LT; / SPAN&GT;&LT; BR&GT;&LT; / BR&GT; 67皮革和LT; BR&GT;&LT; / BR&GT; 68脑
       &LT;! - YY - &GT;
       &LT;跨度类=Q0&GT;母牛奖励:+9牛功率P; / SPAN&GT;&LT; BR&GT;&LT; / BR&GT;羊功率60/60℃; BR&GT;&LT; / BR&GT;羊88 LT; BR&GT;&LT; / BR&GT;奶牛关555 LT; / TD&GT;&LT; / TR&GT;&LT; /表&gt;
       &LT; - 5695:5:40:45 - &GT;
       ';
        //]]&GT;
      &LT; / SCRIPT&GT;
     &LT; / DIV&GT;
     &LT; / DIV&GT;
    &LT; / DIV&GT;
 &LT; /身体GT;
&LT; / HTML&GT;


解决方案

BeautifulSoup认为CData的为通航弦的一个特例(子类)。因此,例如:

 进口BeautifulSoupTXT ='''&LT; foobar的&gt;我们有
       &LT;![CDATA [一些数据,这里]&GT;
       和更多。
       &LT; / foobar的&GT;'''汤= BeautifulSoup.BeautifulSoup(TXT)
在soup.findAll(文= TRUE)CD:
  如果isinstance(CD,BeautifulSoup.CData):
    打印'CData的内容:%R'%CD

在你的课程的情况下,你可以看看在子树开始在与主目录标识的分区,而不是遍布文档树。

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.

I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.

I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">  
<head>  
<title>
   Cows and Sheep
  </title>
</head>
<body>
 <div id="main">
  <div id="main-precontents">
   <div id="main-contents" class="main-contents">
    <script type="text/javascript">
       //<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
       <!--ts-->
       get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
       <!--yy-->
       <span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
       <!--?5695:5:40:45-->
       ';
        //]]>
      </script>
     </div>
     </div>
    </div>
 </body>
</html>

解决方案

BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:

import BeautifulSoup

txt = '''<foobar>We have
       <![CDATA[some data here]]>
       and more.
       </foobar>'''

soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
  if isinstance(cd, BeautifulSoup.CData):
    print 'CData contents: %r' % cd

In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.

这篇关于我怎样才能抓住CData的出BeautifulSoup的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆