用户代理标头-mysql存储的缩写 [英] User agent header - abbreviation for mysql storing

查看:81
本文介绍了用户代理标头-mysql存储的缩写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据该主题,特别是这篇文章: https://stackoverflow.com/a/6595973/1125465 ,微软一如既往地炫耀.用户代理的规模可能非常大.

我正在使用php创建一个小的访问者库,并且我想存储用户代理信息.我无法确定数据类型和长度.

所以我的问题是:关于如何将用户代理缩短到正常"大小,您有任何想法吗? (例如256个字符).


注意:开发人员使用用户代理来检测用户浏览器和操作系统.因此,根据链接的示例,来自M $的所有愚蠢数字都只是...只是.一如既往,让我们紧张起来. 因此,想法是创建一个缩短用户代理字符串但不丢失重要信息的函数.

我认为这样的功能应该:

  • 不依赖于将来的更新和新的浏览器(没有硬编码的字符串)
  • 具有一个简单的机制来决定要删除的内容(例如,如果有数字,逗号,数字,逗号,数字,逗号,数字...,则可以删除它,这并不有趣)./li>
  • 最后,如果所有操作仍然导致用户代理过长(让我们说256个字符),则无需执行其他操作,因此只需切断其余部分即可.这是百万分之一,因此数据可能会丢失.

附加说明:我知道,我可以创建一个函数来从用户代理获取浏览器和OS类型,并仅保存这些值.但由于此类函数始终具有硬编码的名称,并且如果无法识别浏览器,则返回无法识别的浏览器".因此,将来每个人都必须记住有关更新这些函数的信息.如果我们保存了简短的用户代理,则该信息将被删除.不会丢失(因为只有正在读取数据库的脚本必须具有新的识别系统).但是,数据库中的条目确实是可靠且一致的.


更新: 由于应该有一些代码,并且想法有问题,而不是现有代码有问题,因此,我将写一些到目前为止我写过的最低限度的代码;):

<?php
    function shorten($useragent, $maxsize = 256) {
        $shorten = $useragent;
        ... // ?
        $shorten = substr($shorten, 0, $maxsize); // the "last hope" cut
        return $shorten;
    }
    echo shorten($_SERVER['HTTP_USER_AGENT']);
?>

解决方案

对于User-Agent字符串没有任何规则,因此无法创建完全正确且面向未来的解析器.虽然有一个通用的模式:

User-Agent: <engine-string> <engine-string> ...

engine-string的格式为:

<agent-name> (<comment>; <comment>; ...)

每个引擎字符串(根据我的理解,我只是称它为不正确的字符串)可能带有注释,也可能没有注释.

例如:

Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) ↲
AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e ↲
Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

(这是一个字符串,我只是将其分成几行.)似乎,每当有人使用浏览器引擎的分支时,他们只会将自己的内容附加到末尾.因此,我们有一些抽象的"Mozilla"浏览器(第一次浏览器大战"的遗留)认为它在iPhone上.然后,我们看到有一个WebKit(它记得它很早以前就诞生为KHTML).然后是一些Version/6.0修改,然后被修改为Mobile/10A5376e,成为Safari/8536.25,最终揭示了它实际上是移动Google bot的秘密.

另一个例子:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.4; ↲
InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; ↲
.NET CLR 3.5.30729; .NET CLR 1.1.4322)

这是一个引擎,但是在括号中有很多话要说.

因此,一般观察是:

  • 最后一个引擎字符串是最重要的
  • 用括号括起来的最后一句话不太重要.

请记住,我的想法是将字符串解析为这些引擎和注释标记,然后从每个引擎部分中丢弃从第5个开始的注释.然后,如果还不够,请从第二个开始丢弃引擎部分(第一个通常是抽象的"Mozilla",但通常会提供有用的注释;有时实际上也很具体,尤其是对于Web爬虫而言).

在解析时,我们需要考虑到有时可能存在不遵循此格式的字符串.可以将它们保存到日志文件中以供以后检查,然后只需将其切成所需的长度以适合数据库即可.

According to this thread, and specially this post: https://stackoverflow.com/a/6595973/1125465, Microsoft as always shows off. The size of user agent, can be really, really huge.

I'm working on a little visitors library in php, and I want to store user agent information. I cannot decide on the data type and length.

So my question is: have you got any ideas, on how to shorten the user agent, to some "normal" size? (for example 256 chars).


Note: Developers use user agents for detecting the user browser, and operating systems. So according to the linked example, all the stupid numbers from M$ are just... Just are. As always, getting on our nerves. So the idea is to make a function that shorten the user agent string but is not losing the important information.

I think that such a function should:

  • Not depend on future updates and new browsers (no hardcoded strings)
  • Have a simple mechanism that decide what to delete (for example, if there is a number, comma, number, comma, number, comma, number, ..., it can delete it, it is not interesting).
  • And at the end if all the operations still results in too long user agent (lets say 256 chars), there is nothing more to do, so just cut off the rest. This is one per million, so the data can be lost.

Additional note: I know, that I can make a function that get the browser, and OS type from user agent, and save only these values. But as always such a functions have hardcoded names, and if browser isn't recognized, it for example return "Unrecognized browser'. So in the future everyone must remember about updating these function. And if we save shorten user agent, the information isn't lost (as only the script that is reading the database must have new recognition system). But the entries in database are reliable and consistent, as should be.


UPDATE: As there should be some code, and there is a problem with idea, and not the problem with existing code, I will write some minimum code, that I wrote so far ;) :

<?php
    function shorten($useragent, $maxsize = 256) {
        $shorten = $useragent;
        ... // ?
        $shorten = substr($shorten, 0, $maxsize); // the "last hope" cut
        return $shorten;
    }
    echo shorten($_SERVER['HTTP_USER_AGENT']);
?>

解决方案

There are no rules for User-Agent strings, so there is no way to create a completely correct and future-proof parser. There is a general pattern though:

User-Agent: <engine-string> <engine-string> ...

Where engine-string has form:

<agent-name> (<comment>; <comment>; ...)

Each engine string (I just called it that from my understanding, that may be not correct) may or may not have comments.

For example:

Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) ↲
AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e ↲
Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

(This is a single string, I just broke it into lines.) It seems, whenever someone does a fork of a browser engine, they just append their thing to the end. So we have some abstract "Mozilla" browser (a legacy of the "First Browser War") which thinks it's on iPhone. Then we see that there is a WebKit (which remembers that it was born as KHTML some long time ago). Then there is some Version/6.0 modification, which was then modified into Mobile/10A5376e, which became Safari/8536.25, which finally reveals the secret that it is actually a mobile Google bot.

Another example:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.4; ↲
InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; ↲
.NET CLR 3.5.30729; .NET CLR 1.1.4322)

This is a single engine, but it has much to say in parentheses.

So the general observation is:

  • last engine strings are most important,
  • last comments in parenteses are less important.

Having that in mind, my idea would be to parse the string into these engine and comment tokens, then from each engine section throw away comments starting from, say, the fifth. Then, if it is still not enough, throw away engine sections starting from the second (the first is often an abstract "Mozilla", but often has useful comments; also sometimes it is actually something concrete, especially for web crawlers).

When parsing, we need to take into account that occasionally there may be strings not following this format. They can be saved to a log file for later inspection and then simply cut to the needed length to fit to the database.

这篇关于用户代理标头-mysql存储的缩写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆