Oracle文本 - 索引BLOB字段(其中包含PDF数据) [英] Oracle Text - Index a BLOB Field (which contains PDF data)

查看:153
本文介绍了Oracle文本 - 索引BLOB字段(其中包含PDF数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您有任何人使用Oracle Text搜索PDF文件中的内容吗?

Do any of you have any experience with using Oracle Text to search for content inside PDF files?

我有一个表,名为FILEDATA(blob) 。

I have a table, with a field called FILEDATA(blob).

我想执行以下查询:

SELECT id FROM ttc.contract_attachment WHERE CONTAINS(filedata, 'EXAMPLE') > 0;

但是,我不太确定要添加到其中的索引类型。

However, i'm not too sure about the type of index to add to it.

我找到了以下代码:

begin 
  ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER'); 
  ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-'); 
end; 
/ 

create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
  parameters ('lexer doc_lexer sync (on commit)');

参考: http://www.devx.com/dbzone/Article/21563/1954

我没有想法BASIC_LEXER是什么。我有点失落。我将努力继续寻找答案。任何帮助将是巨大的。

I have no idea what BASIC_LEXER is. I'm at a bit of a loss. I shall endeavour to continue searching for an answer. Any help would be great.

谢谢。

推荐答案

ve使用Oracle Text不仅索引PDF的索引,而且还索引其他数据(如XML结构)。 Oracle具有词法分析器的概念,它接收内容和解析,标记和索引令牌。基本词法分类器处理英语单词,还有其他中文,日语,韩语等词汇。printjoin属性允许您索引通常被排除的字符,如连字符,引号等。

I've used Oracle Text to index not only PDF's but other data like XML structures. Oracle has the concept of lexers which take content and parses, tokenizes and indexes the tokens. The basic lexer handles English words, there are other lexers for Chinese, Japanese, Korean, etc. The printjoin attribute allows you to index characters that are normally excluded such as hyphes, quotes, etc.

上面定义的索引可以工作。请记住,Oracle文本索引是一个异步过程,意味着提交发生,然后在将来某个时间文档被索引。但是,您需要将索引同步为计划作业等的一部分。使用索引上的sync(on commit)选项,它将在交易中将文档编入索引。这是值得注意的,只有当你索引相当大的PDF文档。

The index you have defined above will work. Keep in mind that Oracle Text indexing is an asynchronous process, meaning the commit occurs and then sometime in the future the document is indexed. However you will need to synchronize the index as part of a scheduled job or the like. With the option "sync (on commit)" on your index, it will index the document as part of the transaction. This is noteworthy only if you are indexing sizable PDF documents.

我建议对任何搜索你可能想运行渐进放松,因为它可以限制搜索并扩展到更通用的搜索,从而向用户提供在相关性上降低的结果。例如:

I would recommend utilizing progressive relaxation for any search you may want to run, as it can being with a restrictive search and expand out to a more generic search, thereby providing the user with results that are decreasing in relevancy. For instance:

    <query>
   <textquery lang="ENGLISH" grammar="CONTEXT"> cat dog
     <progression>
       <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
       <seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
       <seq><rewrite>transform((TOKENS, "{", "}", "ACCUM"))</rewrite></seq>
     </progression>
   </textquery>
  <score datatype="INTEGER" algorithm="COUNT"/>
</query>

上述查询将搜索关键字cat dog标记为试图找到它们作为短语,文档包含猫和狗(不一定彼此相邻),则包含猫或狗的任何文档,包含两个单词的文档的得分高于如果文档仅具有单个单词的文档。此外,结构会在返回结果时自动重新计算结果。

The above query tokenizes the search keywords "cat dog" attempts to find them as a phrase, then any documents contains cat AND dog (not necessarily beside each other), then any document containing cat OR dog, documents containing both words are scored higher than if a document just has a single one. Futhermore the structure automatically dedups the results as it returns them.

所有这一切,你可以简单地定义你的索引为:

All of that being said, you could simply define your index as:

create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
  parameters ('sync (on commit)');

它可能会很好地满足您的需求。你只需要改变词法分析器的行为,如果你有这样做的需要。我希望这有助于。

and it would probably work very well for your needs. You would only need to change the behavior of the lexer if you have a need for doing so. I hope this helps.

这篇关于Oracle文本 - 索引BLOB字段(其中包含PDF数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆