pdf文档中的单词数量 [英] number of words in pdf document

查看:566
本文介绍了pdf文档中的单词数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用AcroAXPDF来查看我的pdf文档。如何在pdf文档中找到单词数?

i'm using AcroAXPDF to view my pdf documents. how do i find the number of words in a pdf document?

推荐答案

以下使用NuGet上提供的iTextSharp

The following uses iTextSharp available on NuGet

https://www.nuget.org/packages/iTextSharp/

我将所有代码放在一个表单中,但如果这适用于您,请考虑创建一个类并从类中调用代码。注意,如果PDF大,有大量文本,那么考虑用Async / await包装调用。

I placed all code in a form but if this works for you consider making a class and calling code from the class. Note, if the PDF large, has a great deal of text then consider wrapping the call with Async/await.

Imports System.IO
Imports System.Text
Imports iTextSharp.text.pdf.parser

Public Class Form1
    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        ' here the pdf is in the same folder as the executable and if very large
        Dim fileName = IO.Path.Combine(
            AppDomain.CurrentDomain.BaseDirectory, "MicrosoftTheSecurityDevelopmentLifecycle.pdf")
        Dim pdfContents As String = ExtractAllTextFromPdf(fileName)
        Dim wordCount As Integer = GetWordCountFromString(pdfContents)
        MessageBox.Show(


" Word count {wordCount}")
End Sub
公共函数ExtractAllTextFromPdf(ByVal inputFile As String)As String
'完整性检查
如果Stri ng.IsNullOrEmpty(inputFile)然后
抛出新的ArgumentNullException(" inputFile")
结束如果
如果不是File.Exists(inputFile)那么
抛出新的FileNotFoundException("不能find inputFile",inputFile)
End if

'创建一个流阅读器(不是必需的,但我喜欢控制锁和权限)
使用SR作为新FileStream(inputFile,FileMode) .Open,FileAccess.Read,FileShare.Read)
'创建一个阅读器来阅读PDF
Dim reader As New iTextSharp.text.pdf.PdfReader(SR)

'创建一个缓冲区来存储文本
Dim sb As New StringBuilder()

'使用PdfTextExtractor逐页获取所有文本
For i As Integer = 1到reader.NumberOfPages
sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader,i))
Next i

返回sb.ToString()
结束使用
结束函数
公共函数GetWordCountFromString(ByVal text As String)As Integer
'完整性检查
如果String.IsNullOrEmpty(text )然后
返回0
结束如果

'计算单词
返回RegularExpressions.Regex.Matches(text," \ S +")。
结束功能

结束等级
"Word count {wordCount}") End Sub Public Function ExtractAllTextFromPdf(ByVal inputFile As String) As String 'Sanity checks If String.IsNullOrEmpty(inputFile) Then Throw New ArgumentNullException("inputFile") End If If Not File.Exists(inputFile) Then Throw New FileNotFoundException("Cannot find inputFile", inputFile) End If 'Create a stream reader (not necessary but I like to control locks and permissions) Using SR As New FileStream(inputFile, FileMode.Open, FileAccess.Read, FileShare.Read) 'Create a reader to read the PDF Dim reader As New iTextSharp.text.pdf.PdfReader(SR) 'Create a buffer to store text Dim sb As New StringBuilder() 'Use the PdfTextExtractor to get all of the text on a page-by-page basis For i As Integer = 1 To reader.NumberOfPages sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i)) Next i Return sb.ToString() End Using End Function Public Function GetWordCountFromString(ByVal text As String) As Integer 'Sanity check If String.IsNullOrEmpty(text) Then Return 0 End If 'Count the words Return RegularExpressions.Regex.Matches(text, "\S+").Count End Function End Class


这篇关于pdf文档中的单词数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆