是否可以使用仅搜索我上传的PDF的搜索引擎创建网站? [英] Is it possible to create a website with a search engine that only searches the PDFs I upload?
问题描述
我正在尝试为教育视频库创建用户界面。视频位于其他地方,我想创建一个用户友好的网站,并拥有广泛的搜索引擎,但仅限于视频中涵盖的内容。目前,我手动使用20-30个关键字标记每个视频链接。但是,我希望如果我能弄清楚如何使用每个视频的pdf成绩单作为可搜索的文本,标记将是自动的,并产生更好的搜索引擎。我知道有很多OCR网站,但我没有找到任何自定义OCR搜索引擎的个人网站。这可能吗?
I am trying to create the user interface for an educational video library. The videos are housed somewhere else and I want to create a site that will be user friendly and have an extensive search engine, but only for the content covered in the videos. At the moment I am manually tagging each video link with 20-30 keywords. But, I am hoping if I can figure out how to use the pdf transcripts of each video as searchable text, the tagging will be automatic and result in a better search engine. I know there are many OCR websites out there but I haven't found any personal sites with custom OCR search engines. Is this possible?
推荐答案
OCR?听起来你需要ITextSharp。查看他们的SourceFourge页面并阅读有关如何使用它的一些内容。这是一个简单的片段,可以帮助您从PDF文件中提取一些文本:
itextsharp读取pdf文件 [ ^ ]
OCR? Sounds like you need ITextSharp. Check out their SourceFourge page and do some reading up on how to use it. Here's a simple snippet to get you started with extracting some text from a PDF file:
itextsharp read pdf file[^]
public string ParsePdf(string fileName)
{
if (!File.Exists(fileName))
throw new FileNotFoundException("fileName");
using (PdfReader reader = new PdfReader(fileName))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhitespace(text))
{
sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
}
return sb.ToString();
}
}
}
这篇关于是否可以使用仅搜索我上传的PDF的搜索引擎创建网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!