您将如何获得给定PDF中给定单词的计数? [英] How would you get count of a given word in a given PDF?

查看:112
本文介绍了您将如何获得给定PDF中给定单词的计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

面试问题

我在采访中被问到了这个问题,答案不必是特定于编程语言,特定于平台或特定于工具的.

I have been asked this question in an interview, and the answer doesn't have to be specific programming language, platform- or tool- specific.

该问题的措词如下:

如何获取PDF中给定单词的实例计数.答案不一定是特定于编程,平台或工具的.请让我知道您将如何以一种记忆力和速度高效的方式做到这一点

我出于以下原因发布此问题:

I am posting this question for following reasons:

  1. 为了更好地了解上下文,-我仍然无法理解这个问题的上下文,面试官问这个问题会寻找什么?
  2. 要获得各种各样的意见-我倾向于根据自己的编程语言(C#)技能回答此类问题,但可能还有其他有效的方法可以做到这一点.
  1. To better understand the context - I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
  2. To get diverse opinions - I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.

感谢您的关注.

推荐答案

如果必须编写一个程序来完成此任务,我会找到一个PDF渲染库,该库能够从PDF文件中提取文本,例如 Xpdf ,然后计算单词数. 如果这是一项任务,或者对于非生产质量任务来说需要自动化,那么我只需将文件输入pdftotext程序,然后使用python解析输出文件,将其拆分为多个单词,然后将它们放入字典和出现次数计数.

If I had to write a program to do it, I'd find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words. If this was a one-of task or something that needed to be automated for a non-production quality task, I'd just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.

如果我问这个面试问题,我会寻找一些东西:

If I was asking this interviewing question, I'd be looking for a couple of things:

  1. 了解此任务的设置之间的区别: 一次性脚本比较vs生产代码
  2. 不尝试 实现自己呈现的PDF并尝试查找库 代替.
  1. understanding the difference between the setting for this task: one-off script thingy vs production code
  2. not attempting to implement PDF rendered yourself and trying to find a library instead.

现在,我不会期望任何没有PDF经验的随机候选人,但是您可以对PDF是什么以及单词"是什么进行非常有意义的讨论.您会看到,PDF将文本存储为一串带有坐标的字符串.每个字符串不一定是一个单词.通常情况下,单词会被分成几个完全独立的字符串,这些字符串绝对定位在文档中以组成单个单词.这就是为什么有时在搜索PDF文档中的单词时会得到奇怪的结果的原因.因此,要在文档中实现单词搜索,您必须将这些字符串重新粘在一起(pdftotext会为您处理).

Now I wouldn't expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a "word" is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you'd have to glue these strings back together (pdftotext takes care of that for you).

这根本不是一个坏问题.

It's not a bad question at all.

这篇关于您将如何获得给定PDF中给定单词的计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆