如何在 JavaSript 中从 PDF 中提取文本 [英] How to extract text from PDF in JavaSript

查看:44
本文介绍了如何在 JavaSript 中从 PDF 中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以仅使用 Javascript 来获取 PDF 文件中的文本?如果是的话,谁能告诉我怎么做?

I wonder if is possible to get the text inside of a PDF file by using only Javascript? If yes, can anyone show me how?

我知道有一些服务器端 java、c# 等库,但我不希望使用服务器.谢谢

I know there are some server-side java, c#, etc libraries but I would prefer not using a server. thanks

推荐答案

这是一个古老的问题,但由于 pdf.js 多年来一直在发展,我想给出一个新的答案.也就是说,它可以在本地完成,而无需涉及任何服务器或外部服务.新的 pdf.js 有一个函数:page.getTextContent().您可以从中获取文本内容.我已经使用以下代码成功完成了它.

This is an ancient question, but because pdf.js has been developing over the years, I would like to give a new answer. That is, it can be done locally without involving any server or external service. The new pdf.js has a function: page.getTextContent(). You can get the text content from that. I've done it successfully with the following code.

  1. 你在每一步中得到的是一个承诺.您需要这样编码:.then(function(){...}) 继续下一步.

1) PDFJS.getDocument( data ).then( function(pdf) {

2) pdf.getPage(i).then(function(page){

3) page.getTextContent().then( function(textContent){

你最终得到的是一个字符串数组textContent.bidiTexts[].您将它们连接起来以获得 1 页的文本.文本块的坐标用于判断是否需要插入换行符或空格.(这可能并不完全可靠,但从我的测试来看,它似乎没问题.)

What you finally get is an string array textContent.bidiTexts[]. You concatenate them to get the text of 1 page. Text blocks' coordinates are used to judge whether newline or space need to be inserted. (This may not be totally robust, but from my test it seems ok.)

输入参数data需要是一个URL或者ArrayBuffer类型的数据.我使用了 FileReader API 中的 ReadAsArrayBuffer(file) 函数来获取数据.

The input parameter data needs to be either a URL or ArrayBuffer type data. I used the ReadAsArrayBuffer(file) function in FileReader API to get the data.

希望这会有所帮助.

注意:据其他一些用户称,该库已更新并导致代码中断.根据下面 async5 的评论,您需要将 textContent.bidiTexts 替换为 textContent.items.

Note: According to some other user, the library has updated and caused the code to break. According to the comment by async5 below, you need to replace textContent.bidiTexts with textContent.items.

    function Pdf2TextClass(){
     var self = this;
     this.complete = 0;

    /**
     *
     * @param data ArrayBuffer of the pdf file content
     * @param callbackPageDone To inform the progress each time
     *        when a page is finished. The callback function's input parameters are:
     *        1) number of pages done;
     *        2) total number of pages in file.
     * @param callbackAllDone The input parameter of callback function is 
     *        the result of extracted text from pdf file.
     *
     */
     this.pdfToText = function(data, callbackPageDone, callbackAllDone){
     console.assert( data  instanceof ArrayBuffer  || typeof data == 'string' );
     PDFJS.getDocument( data ).then( function(pdf) {
     var div = document.getElementById('viewer');

     var total = pdf.numPages;
     callbackPageDone( 0, total );        
     var layers = {};        
     for (i = 1; i <= total; i++){
        pdf.getPage(i).then( function(page){
        var n = page.pageNumber;
        page.getTextContent().then( function(textContent){
          if( null != textContent.bidiTexts ){
            var page_text = "";
            var last_block = null;
            for( var k = 0; k < textContent.bidiTexts.length; k++ ){
                var block = textContent.bidiTexts[k];
                if( last_block != null && last_block.str[last_block.str.length-1] != ' '){
                    if( block.x < last_block.x )
                        page_text += "\r\n"; 
                    else if ( last_block.y != block.y && ( last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null ))
                        page_text += ' ';
                }
                page_text += block.str;
                last_block = block;
            }

            textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);
            layers[n] =  page_text + "\n\n";
          }
          ++ self.complete;
          callbackPageDone( self.complete, total );
          if (self.complete == total){
            window.setTimeout(function(){
              var full_text = "";
              var num_pages = Object.keys(layers).length;
              for( var j = 1; j <= num_pages; j++)
                  full_text += layers[j] ;
              callbackAllDone(full_text);
            }, 1000);              
          }
        }); // end  of page.getTextContent().then
      }); // end of page.then
    } // of for
  });
 }; // end of pdfToText()
}; // end of class

这篇关于如何在 JavaSript 中从 PDF 中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆