lucene索引html文件 [英] lucene indexing of html files

查看:194
本文介绍了lucene索引html文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尊敬的用户:我正在研究apache lucene的索引和搜索功能. 我必须索引存储在计算机本地光盘上的html文件.我必须对HTML文件的文件名和内容建立索引.我能够将文件名存储在lucene索引中,但不能存储html文件内容,该文件不仅应索引数据,而且还应索引由图像链接和url组成的整个页面,我如何从这些索引文件访问内容 用于索引我正在使用以下代码:

Dear Users I am working on apache lucene for indexing and searching . I have to index html files stored on the local disc of computer . I have to make indexing on filename and contents of the html files . I am able to store the file names in the lucene index but not the html file contents which should index not only the data but the entire page consisting images link and url and how can i access the contents from those indexed files for indexing i am using the following code:

    File indexDir = new File(indexpath);
    File dataDir = new File(datapath);
    String suffix = ".htm";
    IndexWriter indexWriter = new IndexWriter(
            FSDirectory.open(indexDir),
            new SimpleAnalyzer(),
            true,
            IndexWriter.MaxFieldLength.LIMITED);
    indexWriter.setUseCompoundFile(false);
    indexDirectory(indexWriter, dataDir, suffix);

    numIndexed = indexWriter.maxDoc();
    indexWriter.optimize();
    indexWriter.close();


private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
    try {
        for (File f : dataDir.listFiles()) {
            if (f.isDirectory()) {
                indexDirectory(indexWriter, f, suffix);
            } else {
                indexFileWithIndexWriter(indexWriter, f, suffix);
            }
        }
    } catch (Exception ex) {
        System.out.println("exception 2 is" + ex);
    }
}

private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,
    String suffix) throws IOException {
    try {
        if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
            return;
        }
        if (suffix != null && !f.getName().endsWith(suffix)) {
            return;
        }
        Document doc = new Document();
        doc.add(new Field("contents", new FileReader(f)));
        doc.add(new Field("filename", f.getFileName(),
                Field.Store.YES, Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
    } catch (Exception ex) {
        System.out.println("exception 4 is" + ex);
    }
}

预先感谢

推荐答案

此行代码是不存储您的内容的原因:

This line of code is the reason why your contents is not being stored:

doc.add(new Field("contents", new FileReader(f)));

此方法不存储要编制索引的内容.

This method DOES NOT STORE the contents being indexed.

如果您试图索引HTML文件,请尝试使用 JTidy .它将使该过程更加容易.

If you are trying to index HTML files, try using JTidy. It will make the process much easier.

示例代码:

public class JTidyHTMLHandler {

    public org.apache.lucene.document.Document getDocument(InputStream is) throws DocumentHandlerException {
        Tidy tidy = new Tidy();
        tidy.setQuiet(true);
        tidy.setShowWarnings(false);
        org.w3c.dom.Document root = tidy.parseDOM(is, null);
        Element rawDoc = root.getDocumentElement();

        org.apache.lucene.document.Document doc =
                new org.apache.lucene.document.Document();

        String body = getBody(rawDoc);

        if ((body != null) && (!body.equals(""))) {
            doc.add(new Field("contents", body, Field.Store.NO, Field.Index.ANALYZED));
        }

        return doc;
    }

    protected String getTitle(Element rawDoc) {
        if (rawDoc == null) {
            return null;
        }

        String title = "";

        NodeList children = rawDoc.getElementsByTagName("title");
        if (children.getLength() > 0) {
            Element titleElement = ((Element) children.item(0));
            Text text = (Text) titleElement.getFirstChild();
            if (text != null) {
                title = text.getData();
            }
        }
        return title;
    }

    protected String getBody(Element rawDoc) {
        if (rawDoc == null) {
            return null;
        }

        String body = "";
        NodeList children = rawDoc.getElementsByTagName("body");
        if (children.getLength() > 0) {
            body = getText(children.item(0));
        }
        return body;
    }

    protected String getText(Node node) {
        NodeList children = node.getChildNodes();
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < children.getLength(); i++) {
            Node child = children.item(i);
            switch (child.getNodeType()) {
                case Node.ELEMENT_NODE:
                    sb.append(getText(child));
                    sb.append(" ");
                    break;
                case Node.TEXT_NODE:
                    sb.append(((Text) child).getData());
                    break;
            }
        }
        return sb.toString();
    }
}

要从URL获取InputStream:

To get an InputStream from a URL:

URL url = new URL(htmlURLlocation);
URLConnection connection = url.openConnection();
InputStream stream = connection.getInputStream();

要从文件获取InputStream:

To get an InputStream from a File:

InputStream stream = new FileInputStream(new File (htmlFile));

这篇关于lucene索引html文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆