什么程序可以让我快速搜索700 pdf的20000页文本? [英] What program could let me quickly search through 20000 pages of text across 700 pdfs?

查看:64
本文介绍了什么程序可以让我快速搜索700 pdf的20000页文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好!我是一名研究员,我有一个包含700多个pdf的大约20,000页的数据库。 pdf可以在初级级别上搜索,但我需要一个可以快速搜索它们的编码工具(crawler?)。此外,如果软件可以滤除噪声,可视化结果和汇总数据,那就太棒了。有关预先存在的软件的建议,或者在哪里定制的东西?



我尝试过:



我看过OCR,但它看起来不像我在找什么?我想要更像Kibana的东西。

Hello! I am a researcher and I have a database of about 20,000 pages across 700+ pdfs. The pdfs are searchable on a rudimentary level, but I'd need a coding tool (crawler?) that could quickly search through them. Additionally, it'd be great if the software could filter out noise, visualize results and aggregate data. Any suggestions of pre-existing software, or where to get something custom?

What I have tried:

I've looked into OCR but it doesn't seem like quite what I'm looking for? I'm wanting something more like Kibana.

推荐答案

你说的是ETL(摘录;翻译;加载)。



你还处于提取阶段;其余的(过滤,聚合,可视化)只有之后。



您需要更具体地了解内容。



简单文本扫描仪可能需要几分钟才能开发出来;甚至更少的运行。



(pdf可以包含文本)



http:// www.antlr.org/



一旦你获得了(正确的)原始数据,就可以开始翻译/过滤。
You're talking "ETL" (extract; translate; load).

You're still only at the "extract" phase; the rest (filter, aggregate, visualize) only comes "after".

You need to be more specific about the "content".

A "simple" "text" scanner can take a few minutes to develop; and even less to run.

(pdfs can contain "text")

http://www.antlr.org/

Once you've gotten at the (correct) "raw" data, you can start "translating" / filtering.


这篇关于什么程序可以让我快速搜索700 pdf的20000页文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆