告诉机器人除了人类访问者的统计数据? [英] Tell bots apart from human visitors for stats?

查看:139
本文介绍了告诉机器人除了人类访问者的统计数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望推出自己的简单网络统计脚本。

I am looking to roll my own simple web stats script.

据我所知,路上唯一的主要障碍是告诉人类访客除了机器人。我希望有一个解决方案,我不需要定期维护(即我不想用机器人相关的用户代理更新文本文件)。

The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).

是否有任何开放服务可以做到这一点,就像Akismet为垃圾邮件做的那样?
或者是否有一个专门用于识别蜘蛛和机器人并提供频繁更新的PHP项目?

Is there any open service that does that, like Akismet does for spam? Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?

澄清:我'我不打算阻止机器人。 我不需要100%不漏水的结果。我只想
想从我的统计数据中尽可能多地排除。在
中,知道解析user-Agent是
选项,但是将模式保持为
解析是很多工作。我的
问题是,是否有任何
项目或服务已经支付了

To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just want to exclude as many as I can from my stats. In know that parsing the user-Agent is an option but maintaining the patterns to parse for is a lot of work. My question is whether there is any project or service that does that already.


Bounty:我认为我会将此作为关于该主题的参考问题。最佳/最原始/技术上可行的贡献将获得赏金金额。

Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.


推荐答案

人类和机器人会做类似的事情,但机器人会做人类做的事情吨。让我们试着找出那些东西。在我们看一下行为之前,让我们接受 RayQuang的评论有用。如果访问者拥有机器人的用户代理字符串,那么它可能是一个机器人。除非他们正在努力破坏某些东西,否则我无法想象任何人使用Google Crawler(或类似的东西)作为UA。我知道您不想手动更新列表,但是自动提取该列表应该是好的,即使它在接下来的10年内保持陈旧,也会有所帮助。

Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.

有些人已经提到过Javascript和图片加载,但Google会同时做这两件事。我们必须假设现在有几个机器人会同时做这两个,所以这些不再是人类指标。然而,机器人仍将独一无二地遵循隐形链接。以非常偷偷摸摸的方式链接到页面,我无法将其视为用户。如果这样,我们就有了机器人。

Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.

机器人经常(尽管不总是)尊重robots.txt。用户不关心robots.txt,我们可以假设任何检索robots.txt的人都是机器人。不过,我们可以更进一步,将虚拟CSS页面链接到robots.txt排除的页面。如果加载我们正常的CSS但我们的虚拟CSS不是,它肯定是一个机器人。你必须通过IP构建(可能是内存中)负载表并且不包含在匹配中,但这应该是一个非常可靠的告诉。

Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.

因此,要使用所有这些:通过ip地址维护一个机器人数据库表,可能有时间戳限制。添加任何跟随隐形链接的内容,添加任何加载真实CSS但忽略robots.txt CSS的内容。也许添加所有robots.txt下载程序。过滤用户代理字符串作为最后一步,并考虑使用它来进行快速统计分析,看看这些方法看起来有多强烈地用于识别我们知道的机器人。

So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

这篇关于告诉机器人除了人类访问者的统计数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆