Lucene搜寻器(需要建立Lucene索引) [英] Lucene crawler (it needs to build lucene index)

查看:113
本文介绍了Lucene搜寻器(需要建立Lucene索引)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找用Java或其他任何语言编写的Apache Lucene Web搜寻器.搜寻器必须使用lucene并创建有效的lucene索引和文档文件,因此这就是例如取消了胡扯的原因.

I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example...

有人知道这样的网络爬虫是否存在并且可以吗?如果答案是肯定的,我可以在哪里找到它. Tnx ...

Does anybody know does such a web crawler exist and can If answer is yes where I can find it. Tnx...

推荐答案

您要问的是两个组成部分:

What you're asking is two components:

  1. 网络爬虫
  2. 基于Lucene的自动索引器

首先要鼓起勇气:到那儿去做.从我自己制作的角度出发,我将分别处理这两个组件,因为我不认为您可能无法真正理解其底层内容而使用Lucene来完成您所请求的操作.

First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could use Lucene to do something you've requested without really understanding what's going on underneath.

因此,您有一个要爬网以收集特定资源的网站/目录.假设是列出目录内容的任何普通Web服务器,制作Web爬网程序都很容易:只需将其指向目录的根目录并定义用于收集实际文件的规则,例如以.txt结尾".真的,很简单的东西.

So you have a web site/directory you want to "crawl" through to collect specific resources. Assuming that it's any common web server which lists directory contents, making a web crawler is easy: Just point it to the root of the directory and define rules for collecting the actual files, such as "ends with .txt". Very simple stuff, really.

实际的实现可能是这样的:使用 HttpClient 来获取实际的网页/目录列表,以最有效的方式解析它们,例如使用 XPath 从获取的文档中选择所有链接,或者仅使用Java的 Matcher 类随时可用.如果您决定采用XPath路由,请考虑使用 JDOM 进行DOM处理,并使用

The actual implementation could be something like so: Use HttpClient to get the actual web pages/directory listings, parse them in the way you find most efficient such as using XPath to select all the links from the fetched document or just parsing it with regex using Java's Pattern and Matcher classes readily available. If you decide to go the XPath route, consider using JDOM for DOM handling and Jaxen for the actual XPath.

一旦获得了所需的实际资源(例如一堆文本文件),就需要确定数据类型,以便能够知道要建立索引的内容以及可以安全忽略的内容.为了简单起见,我假设这些文件是纯文本文件,没有任何字段,也没有任何内容,但是如果要存储多个字段,我建议您使搜寻器生成 1..n 访问器和变异器(加分的特殊豆的em> :使Bean 不可变,不允许访问者更改Bean的内部状态,为该豆创建复制构造子 )在其他组件中使用.

Once you get the actual resources you want such as bunch of text files, you need to identify the type of data to be able to know what to index and what you can safely ignore. For simplicity's sake I'm assuming these are plaintext files with no fields or anything and won't go deeper into that but if you have multiple fields to store, I suggest you make your crawler to produce 1..n of specialized beans with accessors and mutators (bonus points: Make the bean immutable, don't allow accessors to mutate the internal state of the bean, create a copy constructor for the bean) to be used in the other component.

就API调用而言,您应该具有类似HttpCrawler#getDocuments(String url)的内容,该内容会返回List<YourBean>与实际的索引器结合使用.

In terms of API calls, you should have something like HttpCrawler#getDocuments(String url) which returns a List<YourBean> to use in conjuction with the actual indexer.

除了Lucene的显而易见的东西之外,例如设置目录并了解其线程模型(随时仅允许一次写操作,即使正在更新索引也可以存在多次读取),您当然希望将bean馈入索引.我已经链接的五分钟教程基本上可以做到这一点,请查看示例addDoc(..)方法,然后将String替换为YourBean.

Beyond the obvious stuff with Lucene such as setting up a directory and understanding its threading model (only one write operation is allowed at any time, multiple reads can exist even when the index is being updated), you of course want to feed your beans to the index. The five minute tutorial I already linked to basically does exactly that, look into the example addDoc(..) method and just replace the String with YourBean.

请注意,Lucene IndexWriter确实具有一些易于控制的清理方法,例如调用 LockObtainFailedException ,当然应该在finally块中完成该操作.

Note that Lucene IndexWriter does have some cleanup methods which are handy to execute in a controlled manner, for example calling IndexWriter#commit() only after a bunch of documents have been added to index is good for performance and then calling IndexWriter#optimize() to make sure the index isn't getting hugely bloated over time is a good idea too. Always remember to close the index too to avoid unnecessary LockObtainFailedExceptions to be thrown, as with all IO in Java such operation should of course be done in the finally block.

  • 您需要记住,Lucene索引的内容也会不时地过期,否则您将永远不会删除任何内容,并且它会变得肿,并且由于其自身的内部复杂性而最终会死掉.
  • 由于线程模型的存在,您很可能需要为索引本身创建一个单独的读/写抽象层,以确保在任何给定时间只能有一个实例可以写入索引.
  • 由于源数据获取是通过HTTP进行的,因此您需要考虑数据的验证以及可能的错误情况(例如服务器不可用),以避免任何形式的格式错误和客户端挂断.
  • 您需要知道要从索引中搜索的内容,才能决定要放入的内容.请注意,必须按日期建立索引,以便将日期分割为年,月,日,时,分,秒,而不是毫秒值,因为从Lucene索引进行范围查询时,[0 to 5]实际上会转换为,这意味着范围查询很快就会消失,因为有最多的查询子部分.
  • You need to remember to expire your Lucene index' contents every now and then too, otherwise you'll never remove anything and it'll get bloated and eventually just dies because of its own internal complexity.
  • Because of the threading model you most likely need to create a separate read/write abstraction layer for the index itself to ensure that only one instance can write to the index at any given time.
  • Since the source data acquisition is done over HTTP, you need to consider the validation of data and possible error situations such as server not available to avoid any kind of malformed indexing and client hangups.
  • You need to know what you want to search from the index to be able to decide what you are going to put into it. Note that indexing by date must be done so that you split the date to say year, month, day, hour, minute, second instead of millisecond value because when doing range queries from Lucene index, the [0 to 5] actually gets transformed into +0 +1 +2 +3 +4 +5 which means the range query dies out very quickly because there's a maximum number of query sub parts.

有了这些信息,我相信您可以在不到一天的时间内创建自己的特殊Lucene索引器,如果要进行严格的测试,则可以创建三个索引器.

With this information I do believe you could make your own special Lucene indexer in less than a day, three if you want to test it rigorously.

这篇关于Lucene搜寻器(需要建立Lucene索引)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆