如何将html文件转储到弹性搜索? [英] How to index dump of html files to elasticsearch?

查看:117
本文介绍了如何将html文件转储到弹性搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的弹性很新,所以我的知识只能从 elasticsearch 网站,我需要帮忙。
我的任务是将html格式的大行数据索引为弹性搜索。我已经抓取了我的数据并将其存储到磁盘上(200 000个html文件)。我的问题是将所有html文件编入弹性搜索的最简单的方法是什么?我应该为每个文件手动进行请求弹性吗?例如:

  curl -XPUT'http:// localhost:9200 / registers / tomas / 1'-d'{ 
user:tomasko,
post_date:2009-11-15T14:12:12,
field 1:field data
字段2:字段2数据
}'

第二个问题是如果我必须解析HTML文档来检索JSON字段1的数据,例如代码示例?



索引后的finaly可能会删除所有HTML文档?感谢所有人。

解决方案

我会看看批量API,允许您在单个请求中发送多个文档,以加快索引过程。您可以发送一批10,20个或更多的文档,具体取决于它们的大小。



根据您要索引的内容,您可能需要解析html,除非您希望将整个HTML作为一个字段进行索引(您可能希望使用 html strip char filter 在这种情况下从索引文本中删除html标签)。



索引后,我会建议确保映射是正确的,您可以找到所需的内容。您可以随时使用 _source 进行索引特殊领域,弹性搜索存储在引擎盖下,但如果您已经编写了索引器代码,则可能需要在需要时重新使用索引(当然还有相同的html文档)。在实践中,您永远不会索引您的数据,所以请注意:)即使弹性搜索总是帮助您使用_source字段),这只是查询现有索引并重新索引其他索引上的所有文档。 p>

I am totaly new in elastic so my knowledge is only from elasticsearch site and I need to help. My task is to index large row data in html format into elastic search. I already crawled my data and stored it onto disk (200 000 html files). My question is what is the simplest way to index all html files into elasticsearch? Should I do it manualy by for each document to make put request to elastic? For example like:

curl -XPUT 'http://localhost:9200/registers/tomas/1' -d '{
    "user" : "tomasko",
    "post_date" : "2009-11-15T14:12:12",
    "field 1" : "field data"
    "field 2" : "field 2 data"
}'

And second question is if I have to parse HTML document to retrieve data for JSON field 1 like in example code over?

And finaly after indexing may I delete all HTML documents? Thanks for all.

解决方案

I'd look at the bulk api that allows you to send more than document in a single request, in order to speed up your indexing process. You can send batch of 10, 20 or more documents, depending on how big they are.

Depending on what you want to index you might need to parse the html, unless you want to index the whole html as a single field (you might want to use the html strip char filter in that case to strip out the html tags from the indexed text).

After indexing I'd suggest to make sure the mapping is correct and you can find what you're looking for. You can always reindex using the _source special field that elasticsearch stores under the hood, but if you already wrote your indexer code you might want to use it again to reindex when needed (of course with the same html documents). In practice, you never index your data once... so be careful :) even though elasticsearch always helps you out with the _source field), it's just a matter of querying the existing index and reindex all its documents on another index.

这篇关于如何将html文件转储到弹性搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆