修复C#或VB程序中的HTML代码。 [英] Repair HTML code in C# or VB program.
问题描述
我有一个程序必须快速准确地处理网页。
C#Webbrowser解析器的问题是不考虑层次结构和父级和放大器;儿童优先。这就是我尝试制作自己的HTML解析器的原因。因此,我编写了自己的HTML解析器,首先将整个HTML转换为具有标签层次结构的树。然后,用户可以轻松地访问每个标签。
用户可以通过以下方式获取标签:
Hi,
I have a program that has to process web pages quickly and accurately.
The problem of C# Webbrowser parser is that is does not regard hierarchy and parent&child priority. This is why I tried to make my own HTML parser. Therefore I wrote my own HTML parser that first converts the whole HTML to a tree with hierarchy of Tags. Then user can reach each tag due to its parents easily.
The user can obtain a tag in a way like:
myHTML["html"]["body"]["div", "id", "content"]["div"]["table", "id", "production"]["tbody"]["tr", 0]["td", "class", "num"].InsideContent
标签是按名称搜索的(标签头中的第一个短语)及其财产的价值或其发生的指数。
但是有一个问题。有些网站有HTML错误。这会给我的解析器带来问题。例如:
Tags are sought by their name(first phrase in tag head) and value of their properties or index of their occurrence.
But there is a problem. Some web sites have HTML bugs. And it makes problem for my parser. For example:
<table>
<thead>
...
</thead>
<tbody>
...
</table>
此处< tbody>
标签已打开但未关闭。
这里解决问题的解析器。
有没有人知道任何HTML校正器(适用于C#或VB)或任何其他方式来解决这个问题?
Here the <tbody>
tag is opened but is not closed.
The parser here towards problem.
Does anyone know any HTML corrector(applicable in C# or VB) or any other way to solve this problem?
推荐答案
首先,不需要编写自定义解析器,因为Framework附带的XML工具(至少从2.0开始)可以执行此操作您。有很多关于如何在网上做这件事的例子。无需从第一原则重新发明轮子。 :)
我想想这也会处理丢失的标签问题;我不确定因为我从未试图做你正在做的事情。如果没有,您将需要手动应用隐含结束标记规则。您知道< tbody>
位于< table>
块内。当你到达那个块的结束标记时,假设它之前有一个< / tbody>
。当您到达外部块的结束标记时,可以使用此规则关闭任何内部块。
First off, there is no need to write a custom parser, as the XML tools that come with the Framework (at least since 2.0) can do that for you. There are a lot of examples on how to do this on the web. No need to reinvent the wheel from first principles. :)
I think this will also handle the missing tags problem; I'm not sure because I've never tried to do what you are doing. If not, you will need to apply an "implied end tag" rule manually. You know that<tbody>
is inside of a<table>
block. When you get to that block's end tag, assume that there is a</tbody>
just before it. You can use this rule to close any inner block when you reach the end tag of an outer block.
这篇关于修复C#或VB程序中的HTML代码。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!