Python XML 解析不适​​用于某些站点 [英] Python XML parsing not working for some sites

查看:17
本文介绍了Python XML 解析不适​​用于某些站点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个基于教程的非常基本的 XML 解析器 此处,用于在 Python 中阅读 RSS 提要.

I have a very basic XML parser based on the tutorial provided here, for the purpose of reading RSS feeds in Python.

def GetRSS(RSSurl):
    url_info = urllib.urlopen(RSSurl)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        for item_node in xmldoc.documentElement.childNodes:
            if (item_node.nodeName == "item"):  
                PrintNodeItems(item_node, ["title","link"])
    else:
        print "error"

def PrintNodeItems(XmlNode, items):
    for item_node in XmlNode.childNodes:
        if item_node.nodeName in items:
            PrintNodesText(item_node)

def PrintNodesText(XmlNode):
    text = ""
    for text_node in XmlNode.childNodes:
        if(text_node.nodeType == Node.TEXT_NODE):
            text = text_node.nodeValue
    if (len(text)>0):
        print text
        print ""

我已经在教程中提供的地址(http://rss.slashdot.org/Slashdot/slashdot),它工作得很好,为我提供了正确的反馈.然而,我在学习如何编写这个模块时的意图是使用它来阅读 RedLetterMedia (http://redlettermedia.com/feed/).当我尝试在该地址的 Python Shell 中使用 GetRSS 函数时,我得到一个空行作为反馈而不是正确的结果.我还在 CNN 的世界"RSS 提要 上对其进行了测试,但没有收到任何结果,因为好.我在所有地址上都使用了 urllib.urlopen,并且它们的节点和子节点似乎都使用相同的格式(<description><link></item></code>).<em class="showen"></em></p> <p class="en">I have tested the GetRSS function on the address provided in the tutorial (http://rss.slashdot.org/Slashdot/slashdot), and it works just fine, providing me with the correct feedback. However, my intention when learning how to write this module was to use it for reading the RSS feed at RedLetterMedia (http://redlettermedia.com/feed/). When I attempt to use the GetRSS function in the Python Shell on that address, I get a blank line as feedback instead of the correct results. I also tested it on CNN's "World" RSS feed, and received no results for that as well. I have used urllib.urlopen on all addresses and they all appear to use the same format for their nodes and child nodes (<code><item><title><description><link></item></code>).</p> <p class="cn">我想,就像我之前的问题一样,我可能遗漏了一些非常明显的东西.有人知道那是什么吗?<em class="showen"></em></p> <p class="en">I figure, as was the case for my previous question, there is probably something very obvious I am missing. Does anybody know what that is?</p> <p class="cn">为了记录,我的错误消息根本没有出现,但这可能是因为我错误地将其集成到代码中;我不会把它超出我的范围.<em class="showen"></em></p> <p class="en"> and for the record, my error message has not come up at all, but maybe that's because I integrated it into the code incorrectly; I would not put it beyond me.</p> <p class="cn">更新:使用 stackoverflow 上的多个回答问题从头开始重写代码.奇迹般有效!<em class="showen"></em></p> <p class="en">update: Rewrote code from scratch using multiple answered questions on stackoverflow. Works like a charm! </p> <pre><code><code>def GetRSS(RSSurl): url_info = urllib.urlopen(RSSurl) if (url_info): xmldoc = minidom.parse(url_info) if (xmldoc): channel = xmldoc.getElementsByTagName('channel') for node in channel: item = xmldoc.getElementsByTagName('item') for node in item: alist = xmldoc.getElementsByTagName('link') for a in alist: linktext = a.firstChild.data print linktext def main(): GetRSS('http://redlettermedia.com/feed/') </code></code></pre> <p class="cn"></p> <h3 class="best_ans mt-1">推荐答案</h3> <p class="cn">错误在这里:</p> <pre><code><code>for item_node in xmldoc.documentElement.childNodes: if (item_node.nodeName == "item"): </code></code></pre> <p class="cn">没有根 <code>item</code> 元素,只有一个 <code>channel</code>.我通过在循环中打印 <code>nodeName</code> 的所有值发现了这一点.<em class="showen"></em></p> <p class="en">There is no root <code>item</code> element, just a <code>channel</code>. I found this out by just printing all the values of <code>nodeName</code> in the loop.</p> <p>这篇关于Python XML 解析不适​​用于某些站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!</p> </div> <div class="arc-body-main-more"> <span onclick="unlockarc('2503555');">查看全文</span> </div> </div> <div> </div> <div class="wwads-cn wwads-horizontal" data-id="166" style="max-width:100%;border: 4px solid #666;"></div> </div> </article> <div id="arc-ad-2" class="mb-1"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-5038752844014834" crossorigin="anonymous"></script> <ins class="adsbygoogle" style="display:block" data-ad-format="autorelaxed" data-ad-client="ca-pub-5038752844014834" data-ad-slot="3921941283"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="widget bgwhite radius-1 mb-1 shadow widget-rel"> <h5>相关文章</h5> <ul> <li> <a target="_blank" title="urlretrieve 不适用于此站点" href="/2459829.html"> urlretrieve 不适用于此站点; </a> </li> <li> <a target="_blank" title="XML 解析适用于 Android 2.2、2.3 但不适用于 ICS" href="/2797786.html"> XML 解析适用于 Android 2.2、2.3 但不适用于 ICS; </a> </li> <li> <a target="_blank" title="mysql_connect 不适用于站点" href="/2312419.html"> mysql_connect 不适用于站点; </a> </li> <li> <a target="_blank" title="BeautifulSoup不适用于某些网站" href="/1968365.html"> BeautifulSoup不适用于某些网站; </a> </li> <li> <a target="_blank" title="xpath不适用于此站点,请验证" href="/1879489.html"> xpath不适用于此站点,请验证; </a> </li> <li> <a target="_blank" title="Elmah 不适用于 asp.net 站点" href="/2680907.html"> Elmah 不适用于 asp.net 站点; </a> </li> <li> <a target="_blank" title="Python XML解析" href="/1584706.html"> Python XML解析; </a> </li> <li> <a target="_blank" title="Geocoder不适用于某些Android手机" href="/825502.html"> Geocoder不适用于某些Android手机; </a> </li> <li> <a target="_blank" title="iPhone - dataWithContentsOfURL:不适用于某些 URL" href="/2318248.html"> iPhone - dataWithContentsOfURL:不适用于某些 URL; </a> </li> <li> <a target="_blank" title="tkinter 粘性不适用于某些帧" href="/2418244.html"> tkinter 粘性不适用于某些帧; </a> </li> <li> <a target="_blank" title="为什么是“查看代码”选项不适用于某些WMAppManifest.xml文件?" href="/1249780.html"> 为什么是“查看代码”选项不适用于某些WMAppManifest.xml文件?; </a> </li> <li> <a target="_blank" title="使用 PHP 解析 XML 导航站点地图" href="/2502685.html"> 使用 PHP 解析 XML 导航站点地图; </a> </li> <li> <a target="_blank" title="Python Flask跨站点HTTP POST - 不适用于特定的允许的来源" href="/765241.html"> Python Flask跨站点HTTP POST - 不适用于特定的允许的来源; </a> </li> <li> <a target="_blank" title="对齐=“右"不适用于 XML" href="/2502878.html"> 对齐=“右"不适用于 XML; </a> </li> <li> <a target="_blank" title="WebBrowser控件不适用于某些javascript构造" href="/890354.html"> WebBrowser控件不适用于某些javascript构造; </a> </li> <li> <a target="_blank" title="URL构造函数不适用于某些字符" href="/2243242.html"> URL构造函数不适用于某些字符; </a> </li> <li> <a target="_blank" title="Python BeautifulSoup XML 解析" href="/2663959.html"> Python BeautifulSoup XML 解析; </a> </li> <li> <a target="_blank" title="Python BeautifulSoup XML解析" href="/1731399.html"> Python BeautifulSoup XML解析; </a> </li> <li> <a target="_blank" title="XML文件解析Python" href="/2061991.html"> XML文件解析Python; </a> </li> <li> <a target="_blank" title="$('#s4-workspace').animate不适用于匿名站点" href="/1358056.html"> $('#s4-workspace').animate不适用于匿名站点; </a> </li> <li> <a target="_blank" title="Python 3.2 不适用于 Python 2.7" href="/2489890.html"> Python 3.2 不适用于 Python 2.7; </a> </li> <li> <a target="_blank" title="Solr DataImportHandler不适用于XML文件" href="/905161.html"> Solr DataImportHandler不适用于XML文件; </a> </li> <li> <a target="_blank" title="扫描不适用于logback.xml" href="/1577243.html"> 扫描不适用于logback.xml; </a> </li> <li> <a target="_blank" title="$.parseXML 不适用于有效的 xml" href="/2797903.html"> $.parseXML 不适用于有效的 xml; </a> </li> <li> <a target="_blank" title="扫描不适用于 logback.xml" href="/2406918.html"> 扫描不适用于 logback.xml; </a> </li> </ul> </div> <div class="mb-1"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-5038752844014834" crossorigin="anonymous"></script> <ins class="adsbygoogle" style="display:block" data-ad-format="autorelaxed" data-ad-client="ca-pub-5038752844014834" data-ad-slot="3921941283"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </div> <div class="side"> <div class="widget widget-side bgwhite mb-1 shadow"> <h5>Python最新文章</h5> <ul> <li> <a target="_blank" title="类型错误:只有长度为1的阵列可以尝试拟合指数的数据转换到Python标量" href="/235728.html"> 类型错误:只有长度为1的阵列可以尝试拟合指数的数据转换到Python标量; </a> </li> <li> <a target="_blank" title="bs4.FeatureNotFound:找不到一棵树建设者您所要求的功能:LXML。你需要安装一个解析器库?" href="/330648.html"> bs4.FeatureNotFound:找不到一棵树建设者您所要求的功能:LXML。你需要安装一个解析器库?; </a> </li> <li> <a target="_blank" title="系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()" href="/604206.html"> 系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all(); </a> </li> <li> <a target="_blank" title="(unicode错误)'unicodeescape'编解码器无法解码位置2-3中的字节:truncated \UXXXXXXXX escape" href="/585928.html"> (unicode错误)'unicodeescape'编解码器无法解码位置2-3中的字节:truncated \UXXXXXXXX escape; </a> </li> <li> <a target="_blank" title="将pandas dataframe中的列从int转换为string" href="/906682.html"> 将pandas dataframe中的列从int转换为string; </a> </li> <li> <a target="_blank" title="Python:由实例对象调用方法:“missing 1 required positional argument:'self'”" href="/512813.html"> Python:由实例对象调用方法:“missing 1 required positional argument:'self'”; </a> </li> <li> <a target="_blank" title="Sparksql过滤与多个条件(与where子句中选择)" href="/220716.html"> Sparksql过滤与多个条件(与where子句中选择); </a> </li> <li> <a target="_blank" title="JSONDe codeError:期待值:1行1列(CHAR 0)" href="/222506.html"> JSONDe codeError:期待值:1行1列(CHAR 0); </a> </li> <li> <a target="_blank" title="Cmake不能找到Python库" href="/516449.html"> Cmake不能找到Python库; </a> </li> <li> <a target="_blank" title="Python - 将Dataframe中的所有项目转换为字符串" href="/605332.html"> Python - 将Dataframe中的所有项目转换为字符串; </a> </li> </ul> </div> <div class="widget widget-side bgwhite mb-1 shadow"> <h5> 热门教程 </h5> <ul> <li> <a target="_blank" title="Java教程" href="/OnLineTutorial/java/index.html"> Java教程 </a> </li> <li> <a target="_blank" title="Apache ANT 教程" href="/OnLineTutorial/ant/index.html"> Apache ANT 教程 </a> </li> <li> <a target="_blank" title="Kali Linux教程" href="/OnLineTutorial/kali_linux/index.html"> Kali Linux教程 </a> </li> <li> <a target="_blank" title="JavaScript教程" href="/OnLineTutorial/javascript/index.html"> JavaScript教程 </a> </li> <li> <a target="_blank" title="JavaFx教程" href="/OnLineTutorial/javafx/index.html"> JavaFx教程 </a> </li> <li> <a target="_blank" title="MFC 教程" href="/OnLineTutorial/mfc/index.html"> MFC 教程 </a> </li> <li> <a target="_blank" title="Apache HTTP客户端教程" href="/OnLineTutorial/apache_httpclient/index.html"> Apache HTTP客户端教程 </a> </li> <li> <a target="_blank" title="Microsoft Visio 教程" href="/OnLineTutorial/microsoft_visio/index.html"> Microsoft Visio 教程 </a> </li> </ul> </div> <div class="widget widget-side bgwhite mb-1 shadow"> <h5> 热门工具 </h5> <ul> <li> <a target="_blank" title="Java 在线工具" href="/Onlinetools/details/4"> Java 在线工具 </a> </li> <li> <a target="_blank" title="C(GCC) 在线工具" href="/Onlinetools/details/6"> C(GCC) 在线工具 </a> </li> <li> <a target="_blank" title="PHP 在线工具" href="/Onlinetools/details/8"> PHP 在线工具 </a> </li> <li> <a target="_blank" title="C# 在线工具" href="/Onlinetools/details/1"> C# 在线工具 </a> </li> <li> <a target="_blank" title="Python 在线工具" href="/Onlinetools/details/5"> Python 在线工具 </a> </li> <li> <a target="_blank" title="MySQL 在线工具" href="/Onlinetools/Dbdetails/33"> MySQL 在线工具 </a> </li> <li> <a target="_blank" title="VB.NET 在线工具" href="/Onlinetools/details/2"> VB.NET 在线工具 </a> </li> <li> <a target="_blank" title="Lua 在线工具" href="/Onlinetools/details/14"> Lua 在线工具 </a> </li> <li> <a target="_blank" title="Oracle 在线工具" href="/Onlinetools/Dbdetails/35"> Oracle 在线工具 </a> </li> <li> <a target="_blank" title="C++(GCC) 在线工具" href="/Onlinetools/details/7"> C++(GCC) 在线工具 </a> </li> <li> <a target="_blank" title="Go 在线工具" href="/Onlinetools/details/20"> Go 在线工具 </a> </li> <li> <a target="_blank" title="Fortran 在线工具" href="/Onlinetools/details/45"> Fortran 在线工具 </a> </li> </ul> </div> </div> </div> <script type="text/javascript">var eskeys = 'python,xml,解析,不适,用于,某些,站点'; var cat = 'cc';';//python</script> </div> <div id="pop" onclick="pophide();"> <div id="pop_body" onclick="event.stopPropagation();"> <h6 class="flex flex101"> 登录 <span onclick="pophide();">关闭</span> </h6> <div class="pd-1"> <div class="wxtip center"> <span>扫码关注<em>1秒</em>登录</span> </div> <div class="center"> <img id="qr" src="https://huajiakeji.com/Content/Images/qrydx.jpg" alt="" style="width:150px;height:150px;" /> </div> <div style="margin-top:10px;display:flex;justify-content: center;"> <input type="text" placeholder="输入验证码" id="txtcode" autocomplete="off" /> <input id="btngo" type="button" onclick="chk()" value="GO" /> </div> <div class="center" style="margin: 4px; font-size: .8rem; color: #f60;"> 发送“验证码”获取 <em style="padding: 0 .5rem;">|</em> <span style="color: #01a05c;">15天全站免登陆</span> </div> <div id="chkinfo" class="tip"></div> </div> </div> </div> <script type="text/javascript" src="https://lib.sinaapp.com/js/jquery/1.9.1/jquery-1.9.1.min.js"></script> <script type="text/javascript" src="https://cdn.bootcss.com/jquery-cookie/1.4.1/jquery.cookie.min.js"></script> <script type="text/javascript" src="https://img01.yuandaxia.cn/Scripts/highlight.min.js"></script> <script type="text/javascript" src="https://img01.yuandaxia.cn/Scripts/base.js?v=0.22"></script> <script type="text/javascript" src="https://img01.yuandaxia.cn/Scripts/tui.js?v=0.11"></script> <footer class="footer"> <div class="container"> <div class="flink mb-1"> 友情链接: <a href="https://www.it1352.com/" target="_blank">IT屋</a> <a href="https://huajiakeji.com/" target="_blank">Chrome插件</a> <a href="https://www.cnplugins.com/" target="_blank">谷歌浏览器插件</a> </div> <section class="copyright-section"> <a href="https://www.it1352.com" title="IT屋-程序员软件开发技术分享社区">IT屋</a> ©2016-2022 <a href="http://www.beian.miit.gov.cn/" target="_blank">琼ICP备2021000895号-1</a> <a href="/sitemap.html" target="_blank" title="站点地图">站点地图</a> <a href="/Home/Tags" target="_blank" title="站点标签">站点标签</a> <a target="_blank" alt="sitemap" href="/sitemap.xml">SiteMap</a> <a href="/1155981.html" title="IT屋-免责申明"><免责申明></a> 本站内容来源互联网,如果侵犯您的权益请联系我们删除. </section> <!--统计代码--> <script type="text/javascript"> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?0c3a090f7b3c4ad458ac1296cb5cc779"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> <script type="text/javascript"> (function () { var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); </script> </div> </footer> </body> </html>