解析 XML/XHTML 文档但忽略 C# 中的错误 [英] Parsing an XML/XHTML document but ignoring errors in C#

查看:21
本文介绍了解析 XML/XHTML 文档但忽略 C# 中的错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一些小应用程序,用于解析几个网页的源代码、提取一些数据并将其保存为另一种格式.具体来说,我的一些银行不提供交易/报表的下载,但他们提供在其网站上访问这些报表的权限.

I'm writing some little applications that parse the source of a few web pages, extract some data, and save it into another format. Specifically, some of my banks don't provide downloads of transactions/statements but they do provide access to those statements on their websites.

我做得很好,但另一个(英国汇丰银行)被证明是个麻烦事,因为它的来源不是有效的 XHTML.例如,<?xml?> 标签前有空格,有些地方用了 == 而不是 =属性名称及其值(例如

  • ).

    I've done one fine, but another (HSBC UK) is proving a pain in the arse, since its source is not valid XHTML. For example there is whitespace before the <?xml?> tag, and there are places where == is used instead of = between an attribute name and its value (e.g. <li class=="lastItem">).

    当然,当我将这些数据传递到我的 XmlDocument 中时,它会抛出一个不稳定的(更准确地说是异常).

    Of course, when I pass this data into my XmlDocument, it throws a wobbly (more accurately an exception).

    我的问题是:是否可以放宽对 C# 中 XML 解析的要求?我知道从源头上解决这些问题要好得多——这也绝对是我的态度——但汇丰银行改变他们的网站的可能性几乎为零,而这些网站已经在大多数浏览器中运行,只是为了小我.

    My question is: is it possible to relax the requirements for XML parsing in C#? I know it's far better to fix these problems at source - that's absolutely my attitude too - but there's roughly zero chance HSBC would change their website which already works in most browsers just for little old me.

    推荐答案

    查看 HTML 敏捷包.它允许您通过 XPath 提取非 XHTML 兼容网页的元素,就好像它是一个格式良好的 XHTML 文档.

    Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.

    为了 Kleene 的热爱,不要尝试对具有任何复杂性的 HTML 页面进行正则表达式!

    And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!

    这篇关于解析 XML/XHTML 文档但忽略 C# 中的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆