从pdf中检索内容 [英] retrieving content from a pdf

查看:70
本文介绍了从pdf中检索内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嗨大家好



我正在做一个大学项目,它基于从PDF中提取信息



我的想法是



1.使用我输入的文字搜索并找出正确的pdf。

例如 - 有在硬盘上有很多混乱的PDF,我想选择关于人工智能的pdf



2.我的搜索查询也是人工智能我还需要提取PDF中的人工智能内容的内容



3.与我的输入查询相关的内容最终将在界面中显示



任何人都可以帮我解决这个问题,包括编码帮助吗?



是hOOt - 全文搜索引擎帮助我编制索引?



我很期待



学生

Hi Guys

I am doing a university project and it based on "Extracting Information from PDF"

My idea is to

1. search and find out correct pdf using the text which I input.
eg - there are lots of jumbled PDF in hard disk and I want to select pdf regarding "Artificial Intelligent"

2. My searching query is also "Artificial Intelligent" I also need to extract the content of Artificial intelligent content inside in the PDF

3. the content relevant to my input query will display in the interface finally

Can anyone help me to sort it out this matter including coding help?

is hOOt - full text search engine help me in indexing?

I am kindly looking forward

Student

推荐答案

你需要在PDF文件上创建全文搜索。

我在下面添加了代码,希望你能理解。

在第4步,你需要与你一起运行问题搜索文本。



You will need to create Full Text Search on PDF file.
I added the code below, hope you will understand.
At step 4 , you need to run the quesry with you "search text".

Step 1: Create Full Text Catalog
EXEC sp_fulltext_database 'enable'
GO

IF NOT EXISTS ( SELECT * FROM sys. fulltext_catalogs
            WHERE name = 'Ducuments_Catalog' )
BEGIN
    EXEC sp_fulltext_catalog 'Ducuments_Catalog' , 'create' ;
END

GO


Step 2: Create a Table

CREATE TABLE [dbo].[T_Document](
    [ID] [bigint] IDENTITY(1,1) NOT NULL,
    [FileName] [varchar](100) NULL,
    [FileType] [varchar](50) NULL,
    [Content] [varbinary](max) NULL,
CONSTRAINT [PK_T_Document] PRIMARY KEY CLUSTERED
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Step 3:  Create Full Text Index on Table Columns

CREATE FULLTEXT INDEX ON [dbo].[T_Document]

(      Content Type Column FileType Language 1033,
        [FileName] language 1033
)
KEY INDEX PK_T_Document
ON Ducuments_Catalog WITH CHANGE_TRACKING AUTO;
GO


Step 4: Run the Query


SELECT * FROM T_Document WHERE FREETEXT (Content,'Borrower Name')

SELECT * FROM T_Document WHERE CONTAINS (Content,'"Borrower Name"')


首先,没有人会为你写你的大学项目,他们也不应该。这应该是关于你学到了什么以及你能做些什么。



但是这里有一些指示。



您将需要能够读取PDF和解析格式的代码,以便将纯文本与PDF中的所有其他内容分开。



这些天所有搜索都是基于索引的形式。听起来你想要全文搜索所以你会想要研究谷歌使用的那种技术。我收集他们的关键数据结构称为 BigTable 。我无法想象为什么:-)



您需要考虑如何一次性建立索引或增量索引,如何在最新时保持最新状态添加和删​​除PDF或是否将其丢弃并重新开始每次搜索。



您需要决定在用户​​输入的查询中允许哪些内容:单词否标点符号,多个单词,匹配的精确短语,搜索代码如+ intelligence -Einstien甚至是完整的正则表达式。



一旦你掌握了所有这些''规范''固定下来的所有低级技术,比如实际读取在测试用例中工作的PDF文件,然后你就可以编写你的应用程序,也可以正确地编写它。



我不知道你在哪里,但在我的时代,大学编程项目的大部分功劳都是为了写作。只要程序有效,他们就不会深入研究可用性标准或源代码的质量,而是直接跳到文档。



如果你遇到困难使用代码部分然后通过各种方式发布更多问题CP是一个很好的来源。
First of no one here is going to write your university project for you and neither should they. It''s supposed to be about what you''ve learned and what you can do.

However here are some pointers.

You''re going to need code that can read a PDF and parse the format to get the plain text separated from all the other stuff that''s in a PDF.

All searching these days is based on forms of indexing. It sounds like you''re going to want full text search so you''ll be wanting to investigate the kind of techniques that Google use. I gather their key data structure is called a BigTable. I can''t imagine why :-)

You need to consider how to build the index, all at once or incremental, how to keep it up to date when PDFs are added and deleted or whether to throw it away and start again for each search.

You need to decide what to allow in the user entered query: single word no punctuation, multiple words, exact phrase for matching, search codes like +intelligence -Einstien or even full regular expressions.

Once you have all of this ''specification'' stuff nailed down and all the low level technologies like actually reading a PDF file working in test cases then you''re ready to write your application and also to write it up properly.

I don''t know what it''s like where you are but in my day most of the credit for university programming projects was for the write up. As long as the program worked they didn''t dig much deeper into usability criteria or quality of the source code but skipped straight to the documents.

If you get stuck with the code parts then by all means post more questions CP is a great source.


这篇关于从pdf中检索内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆