我可以在Storm crawler中存储网页的html内容吗? [英] Can i store html content of webpage in storm crawler?

查看:72
本文介绍了我可以在Storm crawler中存储网页的html内容吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 strom-crawler-elastic .我可以看到获取的网址和状态. ES_IndexInit.sh文件中的配置更改仅提供url,标题,主机,文本.但是我可以用html标签存储整个html内容吗?

I am using strom-crawler-elastic. I can able to see the fetched urls and status of those. Configuration change in ES_IndexInit.sh file gives only url,title, host, text. But can i store the entire html content with html tags ?

推荐答案

ES IndexerBolt从ParseFilter中获取页面的内容,但不对其进行任何处理.一种选择是修改代码,以使其从传入的元组中提取 content 字段并对其进行索引.

The ES IndexerBolt gets the content of pages from the ParseFilter but does not do anything with it. One option would be to modify the code so that it pulls the content field from the incoming tuples and indexes it.

或者,您可以实现一个自定义的ParseFilter,它将页面的内容复制到元数据键值中,并将该字段配置为通过配置文件中的 indexer.md.mapping 进行索引.

Alternatively, you could implement a custom ParseFilter which would copy the content of the page into a metadata key value and configure that field to be indexed via indexer.md.mapping in the configuration file.

无论哪种方式,您都需要修改ES_indexInit.sh,以便以所需的方式对ES中的字段进行索引和/或存储.

Either way, you'd need to modify ES_indexInit.sh so that the field in ES gets indexed and/or stored the way you want it.

.

这篇关于我可以在Storm crawler中存储网页的html内容吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆