解析HTML文档:正则表达式或LINQ? [英] Parsing HTML document: Regular expression or LINQ?

查看:199
本文介绍了解析HTML文档:正则表达式或LINQ?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图解析HTML文档并提取某些元素(任何链接到文本文件)。



目前的策略是一个HTML文档加载到一个字符串。然后找到链接到文本文件的所有实例。它可以是任何类型的文件,但对于这个问题,这是一个文本文件。



我们的最终目标是有一个的IEnumerable string对象的列表。 。这部分很简单,但分析数据是个问题。



 < HTML和GT; 
< HEAD><标题>&胡说LT; /标题>
< /头>
<身体GT;
< BR />
< DIV>此处是您的第一个文本文件:LT; A HREF =http://myServer.com/blah.txt>< / DIV>
<跨度>这里是你的第二个文本文件:LT; A HREF =http://myServer.com/blarg2.txt>< / SPAN>
< DIV>此处是你的第三个文本文件:LT; A HREF =http://myServer.com/bat.txt>< / DIV>
< DIV>此处是你的第四个文本文件:LT; A HREF =http://myServer.com/somefile.txt>< / DIV>
< DIV>感谢您访问<!/ DIV>
< /身体GT;
< / HTML>



最初的方法是:




  • 加载字符串转换为XML文档,并在一个LINQ到XML的方式攻击它。

  • 创建一个正则表达式,查找开始<$字符串C $ C> HREF = ,并用 .TXT


$ b结束$ b

问题存在:




  • 会是什么的正则表达式是什么样子?我是一个正则表达式新手,这是我的正则表达式学习的一部分。

  • 您将使用提取的标签列表哪种方法?

  • 这将是最高效的方法是什么?

  • 哪种方法是最可读/维护





更新:
荣誉给马修上在HTML敏捷性包的建议。它的工作就好了!中的XPath建议工作为好。我希望我能标记两个答案为'答案',但我显然不能。他们的问题都有效的解决方案。



下面是一个使用通过的杰夫。它读取弦细,不会包括不以txt结束的HREF。在给定的样本,它正确不包括在结果中 .txt.snarg 文件(如HTML字符串函数提供)。

 使用系统; 
使用System.Collections.Generic;
使用System.Text;使用System.Text.RegularExpressions
;
:使用System.IO;

命名空间ParsePageLinks
{
类节目
{
静态无效的主要(字串[] args)
{
GetAllLinksFromStringByRegex( );
}

静态列表<串GT; GetAllLinksFromStringByRegex()
{
串myHtmlString = BuildHtmlString();
串txtFileExp =HREF = \([^ \\\] * \\.txt)\;

名单,LT;字符串> foundTextFiles =新的List<串>();

MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString,txtFileExp,RegexOptions.IgnoreCase);
的foreach(在textFileLinkMatches赛米)
{
foundTextFiles.Add(m.Groups [1]的ToString()); //这是您的捕获组
}

返回文件;
}

上配置静态字符串BuildHtmlString()
{
返回新StringReader(@< HTML>< HEAD><标题>布拉赫< /标题>< /头><身体GT ;< BR />
< DIV>此处是您的第一个文本文件:LT; A HREF =http://myServer.com/blah.txt>< / DIV>
<跨度>这里是你的第二个文本文件:LT; A HREF =http://myServer.com/blarg2.txt>< / SPAN>
< DIV>此处是你的第三个文本文件:LT; A HREF =http://myServer.com/bat.txt.snarg>< / DIV>
< DIV>此处是你的第四个文本文件:LT; A HREF =http://myServer.com/somefile.txt>< / DIV>
< DIV>感谢您访问<!/ DIV>< /身体GT;< / HTML方式>)为ReadToEnd();
}
}
}


解决方案

我建议正则表达式。为什么呢?




  • 灵活(不区分大小写,容易
    添加新的文件扩展名,元素
    检查等)

  • 快速写

  • 运行速度快



正则表达式的表达式不会。很难阅读,只要你可以写正则表达式



以此为正则表达式:




HREF =([^] * \.txt)




说明:




  • 它周围的
    的文件名,这将导致$ b括号$ b捕获基每场比赛已被发现后,您可以访问

  • 它逃脱。通过使用
    正则表达式转义字符,反斜杠

  • 它必须匹配除了
    双引号的任何字符:[^],直到找到

    的名为.txt



其转换成一个像这样的转义字符串:

 字符串txtExp =HREF = \([^ \\\] * \\.txt)\

然后你就可以在你的匹配迭代:

 匹配txtMatches = Regex.Matches(输入,EXP,RegexOptions.IgnoreCase); 
的foreach(在txtMatches赛米){
字符串文件名= m.Groups [1]; //这是您的捕获组
}


Trying to parse an HTML document and extract some elements (any links to text files).

The current strategy is to load an HTML document into a string. Then find all instances of links to text files. It could be any file type, but for this question, it's a text file.

The end goal is to have an IEnumerable list of string objects. That part is easy, but parsing the data is the question.

<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div>
<span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span>
<div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div>
<div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div>
<div>Thanks for visiting!</div>
</body>
</html>

The initial approaches are:

  • load the string into an XML document, and attack it in a Linq-To-Xml fashion.
  • create a regex, to look for a string starting with href=, and ending with .txt

The question being:

  • what would that regex look like? I am a regex newbie, and this is part of my regex learning.
  • which method would you use to extract a list of tags?
  • which would be the most performant way?
  • which method would be the most readable/maintainable?


Update: Kudos to Matthew on the HTML Agility Pack suggestion. It worked just fine! The XPath suggestion works as well. I wish I could mark both answers as 'The Answer', but I obviously cannot. They are both valid solutions to the problem.

Here's a C# console app using the regex suggested by Jeff. It reads the string fine, and will not include any href that is not ended with .txt. With the given sample, it correctly does NOT include the .txt.snarg file in the results (as provided in the HTML string function).

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ParsePageLinks
{
    class Program
    {
        static void Main(string[] args)
        {
            GetAllLinksFromStringByRegex();
        }

        static List<string> GetAllLinksFromStringByRegex()
        {
            string myHtmlString = BuildHtmlString();
            string txtFileExp = "href=\"([^\\\"]*\\.txt)\"";

            List<string> foundTextFiles = new List<string>();

            MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase);
            foreach (Match m in textFileLinkMatches)
            {
                foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group
            }

            return files;
        }

            static string BuildHtmlString()
            {
                return new StringReader(@"<html><head><title>Blah</title></head><body><br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div></body></html>").ReadToEnd();
            }       
        }
    }

解决方案

I would recommend regex. Why?

  • Flexible (case-insensitivity, easy to add new file extensions, elements to check, etc.)
  • Fast to write
  • Fast to run

Regex expressions will not be hard to read, as long as you can WRITE regexes.

using this as the regular expression:

href="([^"]*\.txt)"

Explanation:

  • It has parentheses around the filename, which will result in a "captured group" which you can access after each match has been found.
  • It has to escape the "." by using the regex escape character, a backslash.
  • It has to match any character EXCEPT double-quotes: [^"] until it finds
    the ".txt"

it translates into an escaped string like this:

string txtExp = "href=\"([^\\\"]*\\.txt)\"

Then you can iterate over your Matches:

Matches txtMatches = Regex.Matches(input, exp, RegexOptions.IgnoreCase);
foreach(Match m in txtMatches) {
  string filename = m.Groups[1]; // this is your captured group
}

这篇关于解析HTML文档:正则表达式或LINQ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆