如何从 BeautifulSoup 中获取 CData [英] How can i grab CData out of BeautifulSoup

查看:26
本文介绍了如何从 BeautifulSoup 中获取 CData的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个我正在抓取的网站,其结构与以下类似.我希望能够从 CData 块中获取信息.

我正在使用 BeautifulSoup 从页面中提取其他信息,所以如果解决方案可以使用它,它将有助于降低我的学习曲线,因为我是 Python 新手.具体来说,我想获取隐藏在 CData 语句中的两种不同类型的数据.第一个只是文本我很确定我可以在它上面抛出一个正则表达式并得到我需要的东西.对于第二种类型,如果我可以将包含 html 元素的数据放入它自己的 beautifulsoup 中,我可以解析它.

我只是在学习 python 和 beautifulsoup,所以我正在努力寻找能够单独提供 CData 的魔法咒语.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><头><标题>牛羊<身体><div id="main"><div id="main-precontents"><div id="main-contents" class="main-contents"><script type="text/javascript">//<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'咩呼吸',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">奶牛规则!</b><br></br><!--ts-->立即获取<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244奶牛</span><br></br>67 皮革<br></br>68 大脑<!--yy--><span class="q0">奶牛奖励:+9 奶牛力量</span><br></br>绵羊力量 60/60<br></br>绵羊 88<br></br>奶牛等级 555<!--?5695:5:40:45-->';//]]>

解决方案

BeautifulSoup 抓取 CData 需要注意的一件事是不要使用 lxml 解析器.

默认情况下,lxml 解析器将从树中剥离 CDATA 部分并用纯文本内容替换它们,了解详情 这里

#用html.parser试试>>>从 bs4 导入 BeautifulSoup>>>进口BS4>>>s='''<?xml version="1.0" ?><foo><bar><![CDATA[啊啊啊啊啊啊啊]]></foo>'''>>>汤 = BeautifulSoup(s, "html.parser")>>>汤.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>>>

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.

I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice. Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.

I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">  
<head>  
<title>
   Cows and Sheep
  </title>
</head>
<body>
 <div id="main">
  <div id="main-precontents">
   <div id="main-contents" class="main-contents">
    <script type="text/javascript">
       //<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
       <!--ts-->
       get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
       <!--yy-->
       <span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
       <!--?5695:5:40:45-->
       ';
        //]]>
      </script>
     </div>
     </div>
    </div>
 </body>
</html>

解决方案

One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.

By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

#Trying it with html.parser


>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>> 

这篇关于如何从 BeautifulSoup 中获取 CData的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆