如何转义Elasticsearch的网址? [英] How to escape a URL for Elasticsearch?
问题描述
在我在Elasticsearch的一个字段中,我正在存储我的文档的URL(例如 http://techcrunch.com/something-great
)
当我不转义该网址时,文档被正确找到 - 但我收到某些URL上的EOF错误。
当我转义URL时:
String escapedString = QueryParser.escape(e.getKey()。getUrl());
找不到文档 - 我获得零点击。
那么怎么办?
{
_index:crawlbot,
_type:article,
_id:AVFaaFu4w49jUzVInKS5,
_score:1,
_source:{
job:{
id:65,
name:wikipedia_en,
max_pages:300000,
crawl_depth:0,
processing_patterns: - Categories,-User,-Wikipedia :, -Topic,-Special:, - Talk:, - Portal:, - MOS,
status:0,
days:0,
url:[
https: /en.wikipedia.org
],
ajax:false,
min_description:0
},
文章:{
url:https: //en.wikipedia.org/w/index.php?action=history&feed=atom&title=Parliament_of_Romania,
provider_url:https://en.wikipedia.org,
provider_name:,
provider_display:en.wikipedia.org,
favicon_url:http://www.google.com/s2/u/0/favicons?domain=https:// en.wikipedia.org,
language:en,
元数据:{
作者:[]
},
entities:[],
关键字:[],
视频:[],
未过滤关键词:[],
发布:,
published_long:0
}
}
}
我想要检索每篇article.url的文档
这是查询:
SearchRequestBuilder requestBuilder = client.prepareSearch(crawlbot ).setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
String escapedString = QueryParser.escape(e.getKey()。getUrl());
queryBuilder.must(QueryBuilders.queryStringQuery(escapedString).defaultField(article.url));
queryBuilder.must(QueryBuilders.queryStringQuery(e.getKey()。getJob()。getId()+).defaultField(job.id));
如果我不逃避错误:
线程main中的异常org.elasticsearch.action.search.SearchPhaseExecutionException:无法执行phase [query],所有分片失败; shardFailures {[9_T8APppReyWKppSNZWmXw] [crawlbot] [0]:SearchParseException [[crawlbot] [0]:from [-1],size [-1]:Parse Failure [无法解析源[{query:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [1]:SearchParseException [[crawlbot] [1]:from [-1],size [-1]:解析失败[解析源[{查询:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [2]:SearchParseException [[crawlbot] [2]:from [-1],size [-1]:Parse Failure [无法解析源[{query:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [3]:SearchParseException [[crawlbot] [3]:from [-1],size [-1]:解析失败[解析源[{查询:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; } [[9_T8APppReyWKppSNZWmXw] [crawlbot] [4]:SearchParseException [[crawlbot] [4]:from [-1],size [-1]:Parse Failure [无法解析源[{query:{bool :{ 必须:[{ QUERY_STRING:{ 查询: http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4 default_field: article.url}},{ QUERY_STRING:{ 查询: 70, default_field: job.id}}]}}}]]];嵌套:QueryParsingException [[crawlbot]无法解析查询[http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]];嵌套:ParseException [无法解析http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4':第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4];嵌套:TokenMgrError [第1行第111列的词汇错误。遇到:< EOF>之后:/ griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page = 4]; }
在org.elasticsearch.action.search.type.TransportSearchTypeAction $ BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
在org.elasticsearch.action.search.type.TransportSearchTypeAction $ BaseAsyncAction $ 1.onFailure( TransportSearchTypeAction.java:183)
在org.elasticsearch.search.action.SearchServiceTransportAction $ 23.run(SearchServiceTransportAction.java:565)
在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142 )
在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)
在java.lang.Thread.run(Thread.java:745)
我建议您更改 article.url
字段到:
url:{
type:string,
index:not_analyzed
}
没有这样做会让你现场分析,非常难以查询w ay标准分析器将URL分解成几个令牌。
然后,而不是使用 query_string
查询,您可以使用术语
查询以查询您的文档。
SearchRequestBuilder requestBuilder = client.prepareSearch(crawlbot)。setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
queryBuilder.must(QueryBuilders.termQuery(article.url,e.getKey()。getUrl()));
... ^
|
使用术语查询
更新
关于Evaldas的评论(kudos Evaldas!),最终的想法是创建一个自定义分析器,以确保URL也将被更低级。 p>
创建索引时,您可以在设置
中添加新的分析器,然后将其用作 article.url
字段:
PUT / crawlbot
{
settings:{
analysis:{
analyzer:{
url_analyzer:{
type:custom,
tokenizer:keyword,
filter:[smallcase]
}
}
}
},
映射 :{
article:{
properties:{
article:{
url:{
type:string,
analyzer:url_analyzer
}
}
}
}
}
}
In one of my fields in Elasticsearch I'm storing the URL of my documents (e.g. http://techcrunch.com/something-great
)
When I don't escape the URL, the document is found correctly - but I get the EOF error on some URLs.
When I escape the URL with:
String escapedString = QueryParser.escape(e.getKey().getUrl());
The document is not found - I get zero hits.
So how to do it?
{
_index: "crawlbot",
_type: "article",
_id: "AVFaaFu4w49jUzVInKS5",
_score: 1,
_source: {
job: {
id: 65,
name: "wikipedia_en",
max_pages: 300000,
crawl_depth: 0,
processing_patterns: "-Category,-User,-Wikipedia:,-Topic,-Special:,-Talk:,-Portal:,-MOS",
status: 0,
days: 0,
url: [
"https://en.wikipedia.org"
],
ajax: false,
min_description: 0
},
article: {
url: "https://en.wikipedia.org/w/index.php?action=history&feed=atom&title=Parliament_of_Romania",
provider_url: "https://en.wikipedia.org",
provider_name: "",
provider_display: "en.wikipedia.org",
favicon_url: "http://www.google.com/s2/u/0/favicons?domain=https://en.wikipedia.org",
language: "en",
metadata: {
authors: []
},
entities: [],
keywords: [],
videos: [],
unfilteredKeywords: [],
published: "",
published_long: 0
}
}
}
And i would like the to retrieve the document per article.url
This is the query:
SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
String escapedString = QueryParser.escape(e.getKey().getUrl());
queryBuilder.must(QueryBuilders.queryStringQuery(escapedString).defaultField("article.url"));
queryBuilder.must(QueryBuilders.queryStringQuery(e.getKey().getJob().getId() + "").defaultField("job.id"));
Error if i don't escape:
Exception in thread "main" org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query], all shards failed; shardFailures {[9_T8APppReyWKppSNZWmXw][crawlbot][0]: SearchParseException[[crawlbot][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][1]: SearchParseException[[crawlbot][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][2]: SearchParseException[[crawlbot][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][3]: SearchParseException[[crawlbot][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][4]: SearchParseException[[crawlbot][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I suggest you change the mapping of your article.url
field to:
url: {
"type": "string",
"index": "not_analyzed"
}
Failing to do so will make your field analyzed and very hard to query given the way the standard analyzer will break up the URL into several tokens.
Then, instead of using a query_string
query, you can use a term
query in order to query your documents.
SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
queryBuilder.must(QueryBuilders.termQuery("article.url", e.getKey().getUrl()));
... ^
|
use a term query instead
UPDATE
Following up on Evaldas' comment (kudos Evaldas!), in the end the idea is to create a custom analyzer in order to make sure that the URL will be lowercased as well.
When creating your index, you can add a new analyzer in the settings
and then use it as the analyzer of your article.url
field:
PUT /crawlbot
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"article": {
"properties": {
"article": {
"url": {
"type": "string",
"analyzer": "url_analyzer"
}
}
}
}
}
}
这篇关于如何转义Elasticsearch的网址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!