如何使用 nutch 插件解析位于特定 HTML 标签中的内容? [英] How to parse content located in specific HTML tags using nutch plugin?
问题描述
我正在使用 Nutch 抓取网站,我想解析 Nutch 抓取的 html 页面的特定部分.例如,
要搜索的标题</title></h><div id="abc">要搜索的内容
<div class="efg">要搜索的其他内容
我想解析 id ="abc" 和 class="efg" 等的 div 元素.
我知道我必须创建一个插件来自定义解析,因为 Nutch 提供的 htmlparser 插件会删除所有 html 标签、css 和 javascript 内容,只留下文本内容.我参考了这个博客 http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html 但我发现这是用于解析 html 标签,而我想解析具有属性的 html 标签具体值.我发现有人提到 Jericho 对解析特定的 html 标签很有用,但我可以找到任何与 Jericho 相关的 nutch 插件示例.
我需要一些关于如何设计基于具有特定值的属性的标签解析 html 页面的策略的指导.
您可以使用此插件根据 css 规则从页面中提取数据:
https://github.com/BayanGroup/nutch-custom-search
在你的例子中,你可以这样配置:
<字段><field name="custom_content"/></fields><文件><document url=".+" engine="css"><extract-to field="custom_content"><文本><expr value="#abc"/><文本><expr value=".efg"/></extract-to></文档></文件></config>
I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,
<h><title> title to search </title></h>
<div id="abc">
content to search
</div>
<div class="efg">
other content to search
</div>
I want to parse div element with id ="abc" and class="efg" and so on.
I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.
I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.
You can use this plugin to extract data from your pages based on css rules:
https://github.com/BayanGroup/nutch-custom-search
In your example, you can configure it in this way:
<config>
<fields>
<field name="custom_content" />
</fields>
<documents>
<document url=".+" engine="css">
<extract-to field="custom_content">
<text>
<expr value="#abc" />
</text>
<text>
<expr value=".efg" />
</text>
</extract-to>
</document>
</documents>
</config>
这篇关于如何使用 nutch 插件解析位于特定 HTML 标签中的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!