如何从元标记中获取内容的价值? [英] How to get value of content from meta tag?
本文介绍了如何从元标记中获取内容的价值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在努力从元标记中获取值。到目前为止,我已经取得了成功,但我已经获得了如下所示的元标记:
I'm working on getting values from meta tags. So far I've gotten success but stuck at a point where i'm getting meta tag like below:
<meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image">
通过此我无法提取其中的url字符串元标记的内容属性。
我尝试过:
through this i'm not able to extract url string which is in the content property of meta tag.
What I have tried:
Regex meta = new Regex(@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
@"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>");
WebClient web = new WebClient();
web.UseDefaultCredentials = true;
string page = web.DownloadString(url);
WebClient client = new WebClient();
// Add a user agent header in case the
// requested URI contains a query.
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
Stream data = client.OpenRead(url);
StreamReader reader = new StreamReader(data);
string s = reader.ReadToEnd();
//Console.WriteLine(s);
data.Close();
reader.Close();
MatchCollection mc = meta.Matches(s);
int mIdx = 0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
metadata.Add(m.Groups[gIdx].Value);
}
mIdx++;
}
任何解决方案?
推荐答案
使用RegEx调试器查看匹配失败的位置
Debuggex:在线视觉正则表达式测试仪。 JavaScript,Python和PCRE。 [ ^ ]
粘贴您的RegEx。
粘贴您的数据以匹配。
使用光标查看失败的位置。
当你有一个有效的RegEx,使用顶部的Code Snipset按钮。
你会发现问题不是你想的那样。
perlre - perldoc.perl.org [ ^ ]
[更新]
Nota:有超过1种RegEx方言,JavaScript regEx不是C#RegEx,区别在于详细信息。
找到在C#中使用的方言并找出差异。
By JavaScript和C#字符串不能以相同的方式处理特殊字符的方式。
Use a RegEx debugger to see where the match fail
Debuggex: Online visual regex tester. JavaScript, Python, and PCRE.[^]
Paste your RegEx.
Paste your data to match.
Use the cursor to see where it fail.
When you have a valid RegEx, use Code Snipset button on top.
You will see the problem is not what you think.
perlre - perldoc.perl.org[^]
[Update]
Nota: There is more than 1 RegEx dialect, JavaScript regEx is not C# RegEx, difference is in details.
Find which dialect is used in C# and find differences.
By the way JavaScript and C# strings do not handle special chars the same way.
您可以使用HTML解析器而不是使用Regex。我首先推荐HTML Agility Pack: Html Agility Pack - Home [ ^ ]。
-SA
Instead of using Regex, you can use an HTML parser. I would recommend HTML Agility Pack, first of all: Html Agility Pack — Home[^].
—SA
这篇关于如何从元标记中获取内容的价值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文