没有无关链接的链接提取器 [英] Link extractor without unrelated links

查看:62
本文介绍了没有无关链接的链接提取器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下用于链接提取程序的代码,该代码提取给定url的所有内部链接

i have following code for link extractor which extracts all internal links for given url

SearchEngines Search = SearchEngines.Google;
LinksExtractor extractor = new LinksExtractor("http://yahoo.com/",Search,10);
          
for (int i = 0; i < extractor.Links.Count; i++)
{
    Console.Write(extractor.Links[i].Href.ToString());
    //Console.ReadKey();
    Console.ReadLine();
}



该代码为我提供了yahoo.com中的所有墨水
就像yahoo.com/sports
yahoo.com/business
但它也会提供不需要的链接,例如是否在yahoo上为shadi.com投放广告
那么它也会给shadi.com的链接
我不想要
请帮助



This Code giving me all inks inside yahoo.com
like yahoo.com/sports
yahoo.com/business
but it also gives unwanted links like if some advertisement on yahoo for shadi.com
then it give shadi.com''s link also
that i dont want
please help

推荐答案

很难忽略您不想要的链接吗?例如,任何不以"http://yahoo.com/"开头的内容?
Is it that hard to ignore the links you don''t want? For instance, anything that doesn''t start with "http://yahoo.com/"?


我想知道您是否可以利用Google的高级过滤功能来创建您的WebRequest吗?

例如,此Google搜索[
I wonder if you can make use of Google''s Advanced filtering capabilities in creating your WebRequest ?

For example, this Google search[^] shows you only sites within Yahoo.com, and only sites in English.

But, perhaps you''ve already eliminated that as a strategy, so:

If extractor.Links is a collection of type IEnumerable<Link>, then you should be able to use a relatively simple Linq filter operation like:
string matchStr = "yahoo.com";

var filteredMatches = extractor.Links.Where(link => link.Href.ToString().Contains(matchStr)).ToList<Link>();

免责声明:此代码段不在我的头上" '并且可能无法按原样为您工作,未经测试并且可能有缺陷:它仅是向您建议一种策略.

Disclaimer: this code fragment is off the ''top-of-my-head'' and may not work for you as is, is not tested, and may be flawed: it is intended only to suggest a strategy to you.


这篇关于没有无关链接的链接提取器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆