如何对Google的实体ID进行反向工程 [英] how to reverse engineer Google's entity ids

查看:127
本文介绍了如何对Google的实体ID进行反向工程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Google如今在各处都在使用实体,它们通常以/m/和/g/作为前缀(但最近我也看到了/t/)

我想知道编号的工作方式.对于/m/,存在与url缩短器将执行的操作类似的模式.定义一个字母(如果是/m/,则为32个字符"0123456789bcdfghjklmnpqrstvwxyz_",然后将数字转换为短网址"

例如/m/0 4swd<-> 156524("/m/0"似乎是一种前缀)

我仍然受/g/ID的困扰.我从看到的ID"0123456789bcdfghjklmnpqrstvwxyz_"创建了一个合理的字母,但无法正常工作.

由于Google在进行自我转换,因此我举了一个真实的例子: /g/11b6377dzp<-> 576462201963131861

为此:谷歌搜索

但是我仍然无法弄清楚.

我对该过程最感兴趣的是如何处理这个逆向工程问题(当然还有结果).有什么想法吗?

解决方案

您为两种情况提供了相同的字母,但是您的问题暗示它们是不同的.除此之外,这是两种编码方案的说明.

Freebase开发者Wiki ,这是机器ID的编码:

机器生成的id的键是可变长度的短字符序列,由数字,小写字母(不包括元音)和下划线组成. ...(通过避免元音,我们希望避免意外生成攻击性标识符.)中点也是URL安全的,即,它们不需要在URL中使用任何转义或转义.

根据相关的

Google is using entities everywhere nowadays and they are usually prefixed with /m/ and /g/ (but I have also seen some /t/ lately)

I am wondering how the numbering works. For /m/ there is a schema similar to what an url shortener would do. Define an alphabet (in case of /m/ this is 32 characters "0123456789bcdfghjklmnpqrstvwxyz_" and convert a number to a "short url"

e.g. /m/0 4swd <-> 156524 ("/m/0" seems to be a kind of a prefix)

I am stuck with /g/ IDs though. I created a reasonable alphabet from the IDs I have seen "0123456789bcdfghjklmnpqrstvwxyz_" but I can not get it to work.

Since Google is doing some converting itself so I have one real example: /g/11b6377dzp <-> 576462201963131861

from this: Google Search

But I still can not figure this out.

I am mostly interested in the process how to get a handle on this reverse engineering problem (and of course the result). Any ideas?

解决方案

You provided the same alphabet for both cases, but your question implies that they are different. That aside, here's a description of the two encoding schemes.

Quoting from the Freebase developer wiki, here's the encoding for a machine ID:

The keys of machine-generated ids are short variable-length sequences of characters consisting of digits, lower-case letters excluding vowels, and underscore. ... (By avoiding vowels, we hope to avoid accidently [sic] generating offensive identifiers.) Mids are also URL-safe, i.e. they don't require any escaping or unescaping to be used in URLs.

The Google Knowledge Graph IDs are in a separate namespace with the prefix "/g/1" as you noticed and their format, according to the relevant Wikidata property page is

\/g\/1[0-9a-np-z][0-9a-np-z_]{6,8}

so the radix varies by position (no leading underscore allowed) and they chose to only exclude the confusable letter 'o', not all vowels, apparently preferring more encoding space despite the risk of "naughty words."

这篇关于如何对Google的实体ID进行反向工程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆