没有无关链接的链接提取器 [英] Link extractor without unrelated links
问题描述
我有以下用于链接提取程序的代码,该代码提取给定url的所有内部链接
i have following code for link extractor which extracts all internal links for given url
SearchEngines Search = SearchEngines.Google;
LinksExtractor extractor = new LinksExtractor("http://yahoo.com/",Search,10);
for (int i = 0; i < extractor.Links.Count; i++)
{
Console.Write(extractor.Links[i].Href.ToString());
//Console.ReadKey();
Console.ReadLine();
}
该代码为我提供了yahoo.com中的所有墨水
就像yahoo.com/sports
yahoo.com/business
但它也会提供不需要的链接,例如是否在yahoo上为shadi.com投放广告
那么它也会给shadi.com的链接
我不想要
请帮助
This Code giving me all inks inside yahoo.com
like yahoo.com/sports
yahoo.com/business
but it also gives unwanted links like if some advertisement on yahoo for shadi.com
then it give shadi.com''s link also
that i dont want
please help
推荐答案
很难忽略您不想要的链接吗?例如,任何不以"http://yahoo.com/"开头的内容?
Is it that hard to ignore the links you don''t want? For instance, anything that doesn''t start with "http://yahoo.com/"?
我想知道您是否可以利用Google的高级过滤功能来创建您的WebRequest吗?
例如,此Google搜索[
I wonder if you can make use of Google''s Advanced filtering capabilities in creating your WebRequest ?
For example, this Google search[^] shows you only sites within Yahoo.com, and only sites in English.
But, perhaps you''ve already eliminated that as a strategy, so:
If extractor.Links is a collection of type IEnumerable<Link>, then you should be able to use a relatively simple Linq filter operation like:
string matchStr = "yahoo.com";
var filteredMatches = extractor.Links.Where(link => link.Href.ToString().Contains(matchStr)).ToList<Link>();
免责声明:此代码段不在我的头上" '并且可能无法按原样为您工作,未经测试并且可能有缺陷:它仅是向您建议一种策略.
Disclaimer: this code fragment is off the ''top-of-my-head'' and may not work for you as is, is not tested, and may be flawed: it is intended only to suggest a strategy to you.
这篇关于没有无关链接的链接提取器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!