解析 XML/XHTML 文档但忽略 C# 中的错误 [英] Parsing an XML/XHTML document but ignoring errors in C#
问题描述
我正在编写一些小应用程序,用于解析几个网页的源代码、提取一些数据并将其保存为另一种格式.具体来说,我的一些银行不提供交易/报表的下载,但他们提供在其网站上访问这些报表的权限.
I'm writing some little applications that parse the source of a few web pages, extract some data, and save it into another format. Specifically, some of my banks don't provide downloads of transactions/statements but they do provide access to those statements on their websites.
我做得很好,但另一个(英国汇丰银行)被证明是个麻烦事,因为它的来源不是有效的 XHTML.例如,<?xml?>
标签前有空格,有些地方用了 ==
而不是 =
属性名称及其值(例如 ).
I've done one fine, but another (HSBC UK) is proving a pain in the arse, since its source is not valid XHTML. For example there is whitespace before the <?xml?>
tag, and there are places where ==
is used instead of =
between an attribute name and its value (e.g. <li class=="lastItem">
).
当然,当我将这些数据传递到我的 XmlDocument
中时,它会抛出一个不稳定的(更准确地说是异常).
Of course, when I pass this data into my XmlDocument
, it throws a wobbly (more accurately an exception).
我的问题是:是否可以放宽对 C# 中 XML 解析的要求?我知道从源头上解决这些问题要好得多——这也绝对是我的态度——但汇丰银行改变他们的网站的可能性几乎为零,而这些网站已经在大多数浏览器中运行,只是为了小我.
My question is: is it possible to relax the requirements for XML parsing in C#? I know it's far better to fix these problems at source - that's absolutely my attitude too - but there's roughly zero chance HSBC would change their website which already works in most browsers just for little old me.
推荐答案
查看 HTML 敏捷包一>.它允许您通过 XPath 提取非 XHTML 兼容网页的元素,就好像它是一个格式良好的 XHTML 文档.
Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.
为了 Kleene 的热爱,不要尝试对具有任何复杂性的 HTML 页面进行正则表达式!
And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!
这篇关于解析 XML/XHTML 文档但忽略 C# 中的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!