什么程序可以让我快速搜索700 pdf的20000页文本? [英] What program could let me quickly search through 20000 pages of text across 700 pdfs?
问题描述
您好!我是一名研究员,我有一个包含700多个pdf的大约20,000页的数据库。 pdf可以在初级级别上搜索,但我需要一个可以快速搜索它们的编码工具(crawler?)。此外,如果软件可以滤除噪声,可视化结果和汇总数据,那就太棒了。有关预先存在的软件的建议,或者在哪里定制的东西?
我尝试过:
我看过OCR,但它看起来不像我在找什么?我想要更像Kibana的东西。
Hello! I am a researcher and I have a database of about 20,000 pages across 700+ pdfs. The pdfs are searchable on a rudimentary level, but I'd need a coding tool (crawler?) that could quickly search through them. Additionally, it'd be great if the software could filter out noise, visualize results and aggregate data. Any suggestions of pre-existing software, or where to get something custom?
What I have tried:
I've looked into OCR but it doesn't seem like quite what I'm looking for? I'm wanting something more like Kibana.
推荐答案
你说的是ETL(摘录;翻译;加载)。
你还处于提取阶段;其余的(过滤,聚合,可视化)只有之后。
您需要更具体地了解内容。
简单文本扫描仪可能需要几分钟才能开发出来;甚至更少的运行。
(pdf可以包含文本)
http:// www.antlr.org/
一旦你获得了(正确的)原始数据,就可以开始翻译/过滤。
You're talking "ETL" (extract; translate; load).
You're still only at the "extract" phase; the rest (filter, aggregate, visualize) only comes "after".
You need to be more specific about the "content".
A "simple" "text" scanner can take a few minutes to develop; and even less to run.
(pdfs can contain "text")
http://www.antlr.org/
Once you've gotten at the (correct) "raw" data, you can start "translating" / filtering.
这篇关于什么程序可以让我快速搜索700 pdf的20000页文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!