如何转义Elasticsearch的网址? [英] How to escape a URL for Elasticsearch?

查看:137
本文介绍了如何转义Elasticsearch的网址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我在Elasticsearch的一个字段中,我正在存储我的文档的URL(例如 http://techcrunch.com/something-great



当我不转义该网址时,文档被正确找到 - 但我收到某些URL上的EOF错误。



当我转义URL时:

  String escapedString = QueryParser.escape(e.getKey()。getUrl()); 

找不到文档 - 我获得零点击。



那么怎么办?






  {
_index:crawlbot,
_type:article,
_id:AVFaaFu4w49jUzVInKS5,
_score:1,
_source:{
job:{
id:65,
name:wikipedia_en,
max_pages:300000,
crawl_depth:0,
processing_patterns: - Categories,-User,-Wikipedia :, -Topic,-Special:, - Talk:, - Portal:, - MOS,
status:0,
days:0,
url:[
https: /en.wikipedia.org
],
ajax:false,
min_description:0
},
文章:{
url:https: //en.wikipedia.org/w/index.php?action=history&feed=atom&title=Parliament_of_Romania,
provider_url:https://en.wikipedia.org,
provider_name:,
provider_display:en.wikipedia.org,
favicon_url:http://www.google.com/s2/u/0/favicons?domain=https:// en.wikipedia.org,
language:en,
元数据:{
作者:[]
},
entities:[],
关键字:[],
视频:[],
未过滤关键词:[],
发布:,
published_long:0
}
}
}

我想要检索每篇article.url的文档



这是查询:

  SearchRequestBuilder requestBuilder = client.prepareSearch(crawlbot ).setSearchType(SearchType.DFS_QUERY_THEN_FETCH); 
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
String escapedString = QueryParser.escape(e.getKey()。getUrl());
queryBuilder.must(QueryBuilders.queryStringQuery(escapedString).defaultField(article.url));
queryBuilder.must(QueryBuilders.queryStringQuery(e.getKey()。getJob()。getId()+).defaultField(job.id));

如果我不逃避错误:

 线程main中的异常org.elasticsearch.action.search.SearchPhaseExecutionException:无法执行phase [query],所有分片失败; shardFailures {[9_T8APppReyWKppSNZWmXw] [crawlbot] [0]:SearchParseException [[crawlbot] [0]:from [-1],size [-1]:Parse Failure [无法解析源[{query:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [1]:SearchParseException [[crawlbot] [1]:from [-1],size [-1]:解析失败[解析源[{查询:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [2]:SearchParseException [[crawlbot] [2]:from [-1],size [-1]:Parse Failure [无法解析源[{query:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [3]:SearchParseException [[crawlbot] [3]:from [-1],size [-1]:解析失败[解析源[{查询:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [4]:SearchParseException [[crawlbot] [4]:from [-1],size [-1]:Parse Failure [无法解析源[{query:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } 
在org.elasticsearch.action.search.type.TransportSearchTypeAction $ BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
在org.elasticsearch.action.search.type.TransportSearchTypeAction $ BaseAsyncAction $ 1.onFailure( TransportSearchTypeAction.java:183)
在org.elasticsearch.search.action.SearchServiceTransportAction $ 23.run(SearchServiceTransportAction.java:565)
在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142 )
在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)
在java.lang.Thread.run(Thread.java:745)


解决方案

我建议您更改 article.url 字段到:

  url:{
type:string,
index:not_analyzed
}

没有这样做会让你现场分析,非常难以查询w ay标准分析器将URL分解成几个令牌。



然后,而不是使用 query_string 查询,您可以使用术语查询以查询您的文档。

  SearchRequestBuilder requestBuilder = client.prepareSearch(crawlbot)。setSearchType(SearchType.DFS_QUERY_THEN_FETCH); 
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
queryBuilder.must(QueryBuilders.termQuery(article.url,e.getKey()。getUrl()));
... ^
|
使用术语查询

更新



关于Evaldas的评论(kudos Evaldas!),最终的想法是创建一个自定义分析器,以确保URL也将被更低级。 p>

创建索引时,您可以在设置中添加新的分析器,然后将其用作 article.url 字段:

  PUT / crawlbot 
{
settings:{
analysis:{
analyzer:{
url_analyzer:{
type:custom,
tokenizer:keyword,
filter:[smallcase]
}
}
}
},
映射 :{
article:{
properties:{
article:{
url:{
type:string,
analyzer:url_analyzer
}
}
}
}
}
}


In one of my fields in Elasticsearch I'm storing the URL of my documents (e.g. http://techcrunch.com/something-great)

When I don't escape the URL, the document is found correctly - but I get the EOF error on some URLs.

When I escape the URL with:

String escapedString = QueryParser.escape(e.getKey().getUrl());

The document is not found - I get zero hits.

So how to do it?


    {
    _index: "crawlbot",
    _type: "article",
    _id: "AVFaaFu4w49jUzVInKS5",
    _score: 1,
    _source: {
        job: {
            id: 65,
            name: "wikipedia_en",
            max_pages: 300000,
            crawl_depth: 0,
            processing_patterns: "-Category,-User,-Wikipedia:,-Topic,-Special:,-Talk:,-Portal:,-MOS",
            status: 0,
            days: 0,
            url: [
                "https://en.wikipedia.org"
            ],
            ajax: false,
            min_description: 0
        },
        article: {
            url: "https://en.wikipedia.org/w/index.php?action=history&feed=atom&title=Parliament_of_Romania",
            provider_url: "https://en.wikipedia.org",
            provider_name: "",
            provider_display: "en.wikipedia.org",
            favicon_url: "http://www.google.com/s2/u/0/favicons?domain=https://en.wikipedia.org",
            language: "en",
            metadata: {
                authors: []
            },
            entities: [],
            keywords: [],
            videos: [],
            unfilteredKeywords: [],
            published: "",
            published_long: 0
        }
    }
}

And i would like the to retrieve the document per article.url

This is the query:

 SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
            BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
            String escapedString = QueryParser.escape(e.getKey().getUrl());
            queryBuilder.must(QueryBuilders.queryStringQuery(escapedString).defaultField("article.url"));
            queryBuilder.must(QueryBuilders.queryStringQuery(e.getKey().getJob().getId() + "").defaultField("job.id"));

Error if i don't escape:

Exception in thread "main" org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query], all shards failed; shardFailures {[9_T8APppReyWKppSNZWmXw][crawlbot][0]: SearchParseException[[crawlbot][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][1]: SearchParseException[[crawlbot][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][2]: SearchParseException[[crawlbot][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][3]: SearchParseException[[crawlbot][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][4]: SearchParseException[[crawlbot][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
    at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

解决方案

I suggest you change the mapping of your article.url field to:

url: {
    "type": "string",
    "index": "not_analyzed"
}

Failing to do so will make your field analyzed and very hard to query given the way the standard analyzer will break up the URL into several tokens.

Then, instead of using a query_string query, you can use a term query in order to query your documents.

SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
queryBuilder.must(QueryBuilders.termQuery("article.url", e.getKey().getUrl()));
...                                 ^
                                    |
                        use a term query instead

UPDATE

Following up on Evaldas' comment (kudos Evaldas!), in the end the idea is to create a custom analyzer in order to make sure that the URL will be lowercased as well.

When creating your index, you can add a new analyzer in the settings and then use it as the analyzer of your article.url field:

PUT /crawlbot
{
    "settings": {
        "analysis": {
            "analyzer": {
                "url_analyzer": {
                    "type":         "custom",
                    "tokenizer":    "keyword",
                    "filter":       [ "lowercase" ]
                }
            }
        }
    },
    "mappings": {
        "article": {
            "properties": {
                "article": {
                    "url": {
                        "type": "string",
                        "analyzer": "url_analyzer"
                    }
                }
            }
        }
    }
}

这篇关于如何转义Elasticsearch的网址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆