Elasticsearch查找子域 [英] Elasticsearch find subdomains
问题描述
我尝试在Elasticsearch中按主域查找子域.我为弹性添加了几个域:
I try find subdomains by main domain in elasticsearch. I added few domains to elastic:
$domains = [
'site.com',
'ns1.site.com',
'ns2.site.com',
'test.main.site.com',
'sitesite.com',
'test-site.com',
];
foreach ($domains as $domain) {
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => ['domain' => $domain],
];
$client->index($params);
}
然后我尝试搜索:
$params = [
'index' => 'my_index',
'type' => 'my_type',
'body' => [
'query' => [
'wildcard' => [
'domain' => [
'value' => '.site.com',
],
],
],
],
];
$response = $client->search($params);
但是什么也没发现.:(
But found nothing. :(
我的映射是: https://pastebin.com/raw/k9MzjJUM
有什么想法要解决吗?
谢谢
推荐答案
您快到了,只是缺少了几件事.
You're almost there, just a couple of things missing.
It's enough to add *
in your query (that's why this query is called wildcard
):
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain" : "*.site.com" }
}
}
这将为您提供以下结果:
This will give you the following result:
{
...
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
}
]
}
}
似乎可行,但我们只能得到其中一个结果(并非全部).
Seems to work, but we only get one of the results (not all of them).
返回到映射,字段 domain
的类型为 文本
:
Returning to your mapping, the field domain
has type text
:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"domain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
这意味着该字段的内容将被标记并小写(使用标准分析器).您可以使用 _analyze
API ,如下所示:
This means that content of that field will be tokenized and lowercased (with standard analyzer). You can see which tokens will be actually searchable using _analyze
API, like this:
POST _analyze
{
"text": "test.main.site.com"
}
{
"tokens": [
{
"token": "test.main.site.com",
"start_offset": 0,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 0
}
]
}
这就是为什么 wildcard
查询可以匹配 test.main.site.com
的原因.
That's why wildcard
query could match test.main.site.com
.
如果我们使用 n1.site.com
,该怎么办?
What if we take n1.site.com
?
POST _analyze
{
"text": "n1.site.com"
}
{
"tokens": [
{
"token": "n1",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "site.com",
"start_offset": 3,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
如您所见,没有以 .site.com
结尾的令牌(请注意.
在 site.com
之前)
As you can see, there is no token that ends with .site.com
(note the .
before the site.com
).
幸运的是,您的映射已经能够返回所有结果.
Fortunately, your mapping is already capable to return all results.
您可以使用 关键字
字段,该字段使用用于查询的确切值:
You could use keyword
field, which uses the exact value for querying:
POST my_index/my_type/_search
{
"query": {
"wildcard" : { "domain.keyword" : "*.site.com" }
}
}
这将为您提供以下结果:
This will give you the following result:
{
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "RoE8VGMBRuo1XmkIXhp0",
"_score": 1,
"_source": {
"domain": "test.main.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "Q4E8VGMBRuo1XmkIFRpy",
"_score": 1,
"_source": {
"domain": "ns1.site.com"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "RYE8VGMBRuo1XmkIORqG",
"_score": 1,
"_source": {
"domain": "ns2.site.com"
}
}
]
}
}
这是进行以"结尾的查询的最佳方法吗?
实际上,不. wildcard
查询可能会很慢:
请注意,此查询可能会很慢,因为它需要遍历许多查询条款.为了防止极慢的通配符查询,请使用通配符术语不应以通配符*或?开头.
Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?.
为达到最佳性能,在您的情况下,建议创建另一个字段 higherLevelDomains
,并手动从原始字段中提取更高级别的域.该文档可能如下所示:
To achieve best performance, in your case, I would suggest creating another field, higherLevelDomains
, and manually extracting the higher level domains from the original. The document might look like:
POST my_index/my_type
{
"domain": "test.main.site.com",
"higherLevelDomains": [
"main.site.com",
"site.com",
"com"
]
}
这将允许您使用 条款
查询:
This will allow you to use term
query:
POST my_index/my_type/_search
{
"query": {
"term" : { "higherLevelDomains.keyword" : "site.com" }
}
}
这可能是您通过Elasticsearch可以执行的最有效的查询.
This is probably the most efficient query you can get with Elasticsearch for such task.
希望有帮助!
这篇关于Elasticsearch查找子域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!