所有格式的字数均没有 [英] No of word count in all formates

查看:108
本文介绍了所有格式的字数均没有的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用php和Javascript获取所有格式(.doc,.docx,.pdf和图像)中的单词.

How to get no of words in all formats(.doc, .docx, .pdf and image) using php and Javascript.

推荐答案

简单的答案是你不能这样做.如果没有在服务器上安装Word,则很难从.doc/.docx文件中获取字数统计.当您尝试从图像中获取单词数时,它将变得更加困难,因为您将需要首先在图像上执行OCR.

使用php和javascript即使在其中一种格式上执行字数统计也将很困难.您需要为要支持的每种格式创建单独的机制.
The simple answer is that you cannot do so. Getting the word count from .doc/.docx files will be tough without Word installed on the server. It will get even tougher when you try to get a word count out of an image, since you will need to perform OCR on the image first.

Using php and javascript to perform the word count on even one of these formats will be difficult. You will need to create a seperate mechanism for each format you want to support.


您不能.
对于初学者,图像格式没有任何单词.像素很多,但没有文字.

每种格式都不同.一些基于文本,其他基于XML,其他基于二进制.
您无法使用PHP,Javascript,VB,C#或Martian来读取所有格式并获得字数统计.
You can''t.
For starters, image formats do not have any words. Lots and lots of pixels, but no words.

Each format is different. Some are text based, others are XML based, others are binary based.
There is nothing you can use to read all formats and get a word count, in PHP, Javascript, VB, C# or Martian.


在这里,我能为您提供帮助的唯一格式是PDF,为此您可以使用 XPdf [
The only format I can give you any help with here is PDF, and for that you can extract text with XPdf[^]. However: getting an accurate word count from some PDFs may be impossible, depending on how the program that created it decides to format the output (just because it appears as a word in a PDF viewer does not mean it was stored as a word in the document, PDF is a very complex format).

As has been mentioned here, getting word count from an image would require OCR, but I don''t know enough about it to give you a recommendation (I do however know, that once again you may be unable to get an accurate word count with OCR).

.docx documents are essentially a zipped collection XML files, and shouldn''t be too difficult to work with. But I don''t know enough about the format to help there beyond that.

.doc documents are also a zipped collection of files, but I don''t know anything about the format of the files contained within (they appear to be some binary format).

I think your best bet is to pick a single file type and stick with it.


这篇关于所有格式的字数均没有的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆