如何使用带有粗体,斜体标识的pdftotext.exe提取文本? [英] How to extract text using pdftotext.exe with bold,italics identification?

查看:134
本文介绍了如何使用带有粗体,斜体标识的pdftotext.exe提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

亲爱的朋友们,



我一直在使用pdftotext.exe从pdf中提取文本。使用这个文本的准确性很好。但问题是我无法识别粗体和斜体文本。

如何识别提取的文本是粗体还是斜体?



我曾尝试过一些其他的插件,如CSWTestingReflow,PDF解析器等。但为了更好的文本准确性,我使用pdftotext.exe



任何想法都会很明显。 。





示例代码:



Dear friends,

i have been using pdftotext.exe to extract text from pdf. The text accuracy was good by using this. But the problem was i can't able to identify bold and italics text.
How can i identify the extracted text was bold or italic?

I had tried some other plugin like CSWTestingReflow, PDF parser etc..but for better text accuracy i was go with pdftotext.exe

Any idea would be appreciable..


sample code:

objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"
''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """"
    If fso.FileExists(sReadPDF & "_Text.txt") = True Then
                'Read the text file
                Set adoStreamOut = New ADODB.Stream
                'adoStreamOut.Charset = "utf-8"
                adoStreamOut.Charset = "us-ascii"
                If adoStreamOut.State Then adoStreamOut.Close
                adoStreamOut.Open
                adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"
                sText = adoStreamOut.ReadText
    End If
    
 DoEvents
sText = Trim(sText)
sText = Trim(Replace(sText, Chr(12), ""))
sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))
sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))
sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))
sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))
sText = Trim(Replace(sText, vbCrLf, " "))
sText = Trim(Replace(sText, ".|||", "." & vbCrLf))
sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))
sText = Trim(Replace(sText, "-|||", ""))
sText = Trim(Replace(sText, "||||||", "--"))
sText = Trim(Replace(sText, "--", "—"))
Do
 sText = Trim(Replace(sText, "  ", " "))
Loop Until InStr(sText, "  ") = False



谢谢

jai


Thanks
jai

推荐答案

这篇关于如何使用带有粗体,斜体标识的pdftotext.exe提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆