修复C#或VB程序中的HTML代码。 [英] Repair HTML code in C# or VB program.

查看:107
本文介绍了修复C#或VB程序中的HTML代码。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个程序必须快速准确地处理网页。

C#Webbrowser解析器的问题是不考虑层次结构和父级和放大器;儿童优先。这就是我尝试制作自己的HTML解析器的原因。因此,我编写了自己的HTML解析器,首先将整个HTML转换为具有标签层次结构的树。然后,用户可以轻松地访问每个标签。

用户可以通过以下方式获取标签:

Hi,
I have a program that has to process web pages quickly and accurately.
The problem of C# Webbrowser parser is that is does not regard hierarchy and parent&child priority. This is why I tried to make my own HTML parser. Therefore I wrote my own HTML parser that first converts the whole HTML to a tree with hierarchy of Tags. Then user can reach each tag due to its parents easily.
The user can obtain a tag in a way like:

myHTML["html"]["body"]["div", "id", "content"]["div"]["table", "id", "production"]["tbody"]["tr", 0]["td", "class", "num"].InsideContent





标签是按名称搜索的(标签头中的第一个短语)及其财产的价值或其发生的指数。



但是有一个问题。有些网站有HTML错误。这会给我的解析器带来问题。例如:





Tags are sought by their name(first phrase in tag head) and value of their properties or index of their occurrence.

But there is a problem. Some web sites have HTML bugs. And it makes problem for my parser. For example:

<table>
    <thead>
    ...
    </thead>
    <tbody>
    ...
</table>  





此处< tbody> 标签已打开但未关闭。

这里解决问题的解析器。

有没有人知道任何HTML校正器(适用于C#或VB)或任何其他方式来解决这个问题?



Here the <tbody> tag is opened but is not closed.
The parser here towards problem.
Does anyone know any HTML corrector(applicable in C# or VB) or any other way to solve this problem?

推荐答案

首先,不需要编写自定义解析器,因为Framework附带的XML工具(至少从2.0开始)可以执行此操作您。有很多关于如何在网上做这件事的例子。无需从第一原则重新发明轮子。 :)



想想这也会处理丢失的标签问题;我不确定因为我从未试图做你正在做的事情。如果没有,您将需要手动应用隐含结束标记规则。您知道< tbody> 位于< table> 块内。当你到达那个块的结束标记时,假设它之前有一个< / tbody> 。当您到达外部块的结束标记时,可以使用此规则关闭任何内部块。
First off, there is no need to write a custom parser, as the XML tools that come with the Framework (at least since 2.0) can do that for you. There are a lot of examples on how to do this on the web. No need to reinvent the wheel from first principles. :)

I think this will also handle the missing tags problem; I'm not sure because I've never tried to do what you are doing. If not, you will need to apply an "implied end tag" rule manually. You know that <tbody> is inside of a <table> block. When you get to that block's end tag, assume that there is a </tbody> just before it. You can use this rule to close any inner block when you reach the end tag of an outer block.


这篇关于修复C#或VB程序中的HTML代码。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆