任何人都有任何 C# 代码来解析 robots.txt 并根据它评估 URL [英] Anybody got any C# code to parse robots.txt and evaluate URLS against it

查看:36
本文介绍了任何人都有任何 C# 代码来解析 robots.txt 并根据它评估 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短的问题:

有没有人有任何 C# 代码来解析 robots.txt,然后根据它评估 URL,看看它们是否会被排除.

长问题:

我一直在为一个尚未发布给 google 的新网站创建站点地图.站点地图有两种模式,用户模式(如传统站点地图)和管理员"模式.

管理员模式将显示网站上所有可能的 URL,包括自定义条目 URL 或特定外部合作伙伴的 URL - 例如 example.com/oprah,适用于在 Oprah 上看到我们网站的任何人.我想在 Excel 电子表格以外的其他地方跟踪已发布的链接.

我不得不假设有人可能会在他们的博客或其他地方发布 /oprah 链接.我们实际上并不希望将这个迷你奥普拉网站"编入索引,因为这会导致非奥普拉观众能够找到奥普拉的特价商品.

因此,在创建站点地图的同时,我还添加了诸如 /oprah 之类的 URL,以便从我们的 robots.txt 文件中排除.

然后(这是实际问题)我想能够在站点地图上显示文件是否已编入索引并且对机器人可见不是很好吗".这将非常简单 - 只需解析 robots.txt,然后针对它评估链接.

然而,这是一个奖励功能",我当然没有时间去写它(甚至认为它可能没有那么复杂) - 所以我想知道是否有人已经编写了任何代码来解析机器人..txt ?

解决方案

不想这么说,但只需谷歌C# robots.txt parser"并单击第一个命中.这是一篇 CodeProject 文章,内容是关于一个用 C# 实现的名为Searcharoo"的简单搜索引擎,它包含一个类 Searcharoo.Indexer.RobotsTxt,描述为:

<块引用>

  1. 检查并下载并解析网站上的 robots.txt 文件(如果存在)
  2. 为 Spider 提供一个接口,根据 robots.txt 规则检查每个 Url

Short question:

Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.

Long question:

I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.

The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.

I would have to assume that someone might publish the /oprah link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.

So at the same time I was creating the sitemap I also added URLS such as /oprah to be excluded from our robots.txt file.

Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.

However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?

解决方案

Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:

  1. Check for, and if present, download and parse the robots.txt file on the site
  2. Provide an interface for the Spider to check each Url against the robots.txt rules

这篇关于任何人都有任何 C# 代码来解析 robots.txt 并根据它评估 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆