将机器人与人类访客分开以获取统计信息? [英] Tell bots apart from human visitors for stats?

查看:19
本文介绍了将机器人与人类访客分开以获取统计信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望推出自己的简单网络统计脚本.

I am looking to roll my own simple web stats script.

据我所知,路上唯一的主要障碍是将机器人与人类访客区分开来.我想要一个不需要定期维护的解决方案(即我不想使用与机器人相关的用户代理更新文本文件).

The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).

是否有任何开放服务可以做到这一点,就像 Akismet 对垃圾邮件所做的那样?或者有没有专门用于识别蜘蛛和机器人并提供频繁更新的PHP项目?

Is there any open service that does that, like Akismet does for spam? Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?

澄清:我不打算阻止机器人.我不需要 100% 防水的结果.我只是想从我的统计数据中尽可能多地排除.在知道解析用户代理是一个选项,但保持模式解析是很多工作.我的问题是是否有这样做的项目或服务已经.

To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just want to exclude as many as I can from my stats. In know that parsing the user-Agent is an option but maintaining the patterns to parse for is a lot of work. My question is whether there is any project or service that does that already.

悬赏:我想我会将此作为该主题的参考问题.最佳/最具原创性/技术上最可行的贡献将获得赏金金额.

Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.

推荐答案

人类和机器人会做类似的事情,但机器人会做人类不会做的事情.让我们试着找出这些东西.在我们研究行为之前,让我们接受 RayQuang 的 评论是有用的.如果访问者拥有机器人的用户代理字符串,则它可能是机器人.我无法想象任何人使用Google Crawler"(或类似的东西)作为 UA,除非他们正在努力破坏某些东西.我知道您不想手动更新列表,但自动拉取该列表应该很好,即使它在接下来的 10 年中保持陈旧,也会有所帮助.

Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.

有些人已经提到了 Javascript 和图像加载,但谷歌会同时做.我们必须假设现在有几个机器人可以同时执行这两项任务,因此这些不再是人类指标.然而,机器人仍将独特地做的是遵循隐形"链接.以一种非常狡猾的方式链接到我作为用户无法看到的页面.如果有人跟进,我们就有了一个机器人.

Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.

机器人通常(尽管并非总是)尊重 robots.txt.用户不关心 robots.txt,我们可以假设任何检索 robots.txt 的人都是机器人.不过,我们可以更进一步,将一个虚拟的 CSS 页面链接到我们被 robots.txt 排除的页面.如果我们的普通 CSS 已加载但我们的虚拟 CSS 没有加载,则它肯定是一个机器人.您必须按 IP 构建(可能是内存中的)负载表,并执行不包含在匹配中的操作,但这应该是一个非常可靠的说明.

Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.

因此,要使用所有这些:按 IP 地址维护一个机器人数据库表,可能带有时间戳限制.添加任何跟随您的不可见链接的内容,添加任何加载真实"CSS 但忽略 robots.txt CSS 的内容.也可以添加所有 robots.txt 下载器.最后一步过滤用户代理字符串,并考虑使用它进行快速统计分析,看看这些方法在识别我们知道是机器人的东西方面的​​作用有多大.

So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

这篇关于将机器人与人类访客分开以获取统计信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆