您将如何获得给定PDF中给定单词的计数? [英] How would you get count of a given word in a given PDF?

查看：112 发布时间：2020/5/25 4:32:49 pdf

本文介绍了您将如何获得给定PDF中给定单词的计数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

面试问题

我在采访中被问到了这个问题，答案不必是特定于编程语言，特定于平台或特定于工具的.

I have been asked this question in an interview, and the answer doesn't have to be specific programming language, platform- or tool- specific.

该问题的措词如下:

如何获取PDF中给定单词的实例计数.答案不一定是特定于编程，平台或工具的.请让我知道您将如何以一种记忆力和速度高效的方式做到这一点

我出于以下原因发布此问题:

I am posting this question for following reasons:

为了更好地了解上下文，-我仍然无法理解这个问题的上下文，面试官问这个问题会寻找什么?
要获得各种各样的意见-我倾向于根据自己的编程语言(C#)技能回答此类问题，但可能还有其他有效的方法可以做到这一点.

To better understand the context - I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
To get diverse opinions - I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.

感谢您的关注.

推荐答案

如果必须编写一个程序来完成此任务，我会找到一个PDF渲染库，该库能够从PDF文件中提取文本，例如 Xpdf ，然后计算单词数. 如果这是一项任务，或者对于非生产质量任务来说需要自动化，那么我只需将文件输入pdftotext程序，然后使用python解析输出文件，将其拆分为多个单词，然后将它们放入字典和出现次数计数.

If I had to write a program to do it, I'd find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words. If this was a one-of task or something that needed to be automated for a non-production quality task, I'd just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.

如果我问这个面试问题，我会寻找一些东西:

If I was asking this interviewing question, I'd be looking for a couple of things:

了解此任务的设置之间的区别: 一次性脚本比较vs生产代码
不尝试实现自己呈现的PDF并尝试查找库代替.

understanding the difference between the setting for this task: one-off script thingy vs production code
not attempting to implement PDF rendered yourself and trying to find a library instead.

现在，我不会期望任何没有PDF经验的随机候选人，但是您可以对PDF是什么以及单词"是什么进行非常有意义的讨论.您会看到，PDF将文本存储为一串带有坐标的字符串.每个字符串不一定是一个单词.通常情况下，单词会被分成几个完全独立的字符串，这些字符串绝对定位在文档中以组成单个单词.这就是为什么有时在搜索PDF文档中的单词时会得到奇怪的结果的原因.因此，要在文档中实现单词搜索，您必须将这些字符串重新粘在一起(pdftotext会为您处理).

Now I wouldn't expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a "word" is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you'd have to glue these strings back together (pdftotext takes care of that for you).

这根本不是一个坏问题.

It's not a bad question at all.

这篇关于您将如何获得给定PDF中给定单词的计数?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

您将如何获得给定PDF中给定单词的计数? [英] How would you get count of a given word in a given PDF?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

您将如何获得给定PDF中给定单词的计数? [英] How would you get count of a given word in a given PDF?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭