Pdf.js(用于节点)未呈现pdf的全部内容 [英] Pdf.js (for node) not rendering entire contents of pdf
问题描述
我正在尝试使用 https:/来搜索pdf文本/www.npmjs.com/package/pdfjs-dist-for-node .
我的代码如下:
gettext: function(){
var data = '../static/example.pdf';
return pdfjs.getDocument(data).then(function(pdf) {
var pages = [];
for (var i = 0; i < pdf.numPages; i++) {
pages.push(i);
}
return Promise.all(pages.map(function(pageNumber) {
return pdf.getPage(pageNumber + 1).then(function(page) {
return page.getTextContent().then(function(textContent) {
return textContent.items.map(function(item) {
return item.str;
}).join(' ');
});
});
})).then(function(pages) {
return pages.join("\r\n")
});
}).then(function(pages){
console.log(pages)
});
}
这似乎可行,但是它会跳过部分文本.具体来说,它会跳过我在原始pdf文档中无法用鼠标突出显示的内容.有没有办法让pdf.js提取这些数据?
This seems to work, but it skips parts of the text. Specifically, it skips whatever I can't highlight with the mouse in the original pdf doc. Is there a way to get pdf.js to pick up on this data?
推荐答案
如果在查看PDF时无法选择文本,则表示它实际上是图像,因此,您不会能够搜索它.
If the text is not selectable when you view the PDF, that means it's actually an image, which therefore means you won't be able to search for it.
很遗憾,这是不可能的,除非您先进行其他设置以对PDF进行OCR,然后再尝试将图像转换为文本.
So unfortunately, this is not possible, unless you set up something else to do some OCR on the PDF first to try to convert images to text.
这篇关于Pdf.js(用于节点)未呈现pdf的全部内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!