Elasticsearch查找子域 [英] Elasticsearch find subdomains

查看:51
本文介绍了Elasticsearch查找子域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在Elasticsearch中按主域查找子域.我为弹性添加了几个域:

I try find subdomains by main domain in elasticsearch. I added few domains to elastic:

 $domains = [
    'site.com',
    'ns1.site.com',
    'ns2.site.com',
    'test.main.site.com',
    'sitesite.com',
    'test-site.com',
];
foreach ($domains as $domain) {
    $params = [
        'index' => 'my_index',
        'type' => 'my_type',
        'body' => ['domain' => $domain],
    ];
    $client->index($params);
}

然后我尝试搜索:

$params = [
    'index' => 'my_index',
    'type' => 'my_type',
    'body' => [
        'query' => [
            'wildcard' => [
                'domain' => [
                    'value' => '.site.com',
                ],
            ],
        ],
    ],
];
$response = $client->search($params);

但是什么也没发现.:(

But found nothing. :(

我的映射是: https://pastebin.com/raw/k9MzjJUM

有什么想法要解决吗?

谢谢

推荐答案

您快到了,只是缺少了几件事.

You're almost there, just a couple of things missing.

在查询中添加 * 就足够了(这就是为什么此查询称为

It's enough to add * in your query (that's why this query is called wildcard):

POST my_index/my_type/_search
{
    "query": {
        "wildcard" : { "domain" : "*.site.com" }
    }
}

这将为您提供以下结果:

This will give you the following result:

{
  ...
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "RoE8VGMBRuo1XmkIXhp0",
        "_score": 1,
        "_source": {
          "domain": "test.main.site.com"
        }
      }
    ]
  }
}

似乎可行,但我们只能得到其中一个结果(并非全部).

Seems to work, but we only get one of the results (not all of them).

返回到映射,字段 domain 的类型为 文本 :

Returning to your mapping, the field domain has type text:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "domain": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

这意味着该字段的内容将被标记并小写(使用标准分析器).您可以使用 _analyze API ,如下所示:

This means that content of that field will be tokenized and lowercased (with standard analyzer). You can see which tokens will be actually searchable using _analyze API, like this:

POST _analyze
{
  "text": "test.main.site.com"
}

{
  "tokens": [
    {
      "token": "test.main.site.com",
      "start_offset": 0,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

这就是为什么 wildcard 查询可以匹配 test.main.site.com 的原因.

That's why wildcard query could match test.main.site.com.

如果我们使用 n1.site.com ,该怎么办?

What if we take n1.site.com?

POST _analyze
{
  "text": "n1.site.com"
}

{
  "tokens": [
    {
      "token": "n1",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "site.com",
      "start_offset": 3,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

如您所见,没有以 .site.com 结尾的令牌(请注意. site.com 之前)

As you can see, there is no token that ends with .site.com (note the . before the site.com).

幸运的是,您的映射已经能够返回所有结果.

Fortunately, your mapping is already capable to return all results.

您可以使用 关键字字段,该字段使用用于查询的确切值:

You could use keyword field, which uses the exact value for querying:

POST my_index/my_type/_search
{
    "query": {
        "wildcard" : { "domain.keyword" : "*.site.com" }
    }
}

这将为您提供以下结果:

This will give you the following result:

{
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "RoE8VGMBRuo1XmkIXhp0",
        "_score": 1,
        "_source": {
          "domain": "test.main.site.com"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "Q4E8VGMBRuo1XmkIFRpy",
        "_score": 1,
        "_source": {
          "domain": "ns1.site.com"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "RYE8VGMBRuo1XmkIORqG",
        "_score": 1,
        "_source": {
          "domain": "ns2.site.com"
        }
      }
    ]
  }
}

这是进行以"结尾的查询的最佳方法吗?

实际上,不. wildcard 查询可能会很慢:

请注意,此查询可能会很慢,因为它需要遍历许多查询条款.为了防止极慢的通配符查询,请使用通配符术语不应以通配符*或?开头.

Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?.

为达到最佳性能,在您的情况下,建议创建另一个字段 higherLevelDomains ,并手动从原始字段中提取更高级别的域.该文档可能如下所示:

To achieve best performance, in your case, I would suggest creating another field, higherLevelDomains, and manually extracting the higher level domains from the original. The document might look like:

POST my_index/my_type
{
  "domain": "test.main.site.com",
  "higherLevelDomains": [
    "main.site.com",
    "site.com",
    "com"
  ]
}

这将允许您使用 条款 查询:

This will allow you to use term query:

POST my_index/my_type/_search
{
    "query": {
        "term" : { "higherLevelDomains.keyword" : "site.com" }
    }
}

这可能是您通过Elasticsearch可以执行的最有效的查询.

This is probably the most efficient query you can get with Elasticsearch for such task.

希望有帮助!

这篇关于Elasticsearch查找子域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆