将PDF转换为Google文档,并获取提取的文本以及指向原始PDF文件的链接到电子表格中 [英] Convert PDFs to Google docs and get extracted text along with link to original PDF file into a spreadsheet

查看:62
本文介绍了将PDF转换为Google文档,并获取提取的文本以及指向原始PDF文件的链接到电子表格中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行脚本以使用OCR将PDF文件转换为Google文档然后将结果填充到Google表格中时,我在获取PDF文件的链接时遇到了一个小问题.

I'm just facing a little problem in getting the link to the PDF file when I run a script to convert PDF files to Google docs with OCR then populate the results into a google sheets.

到目前为止,我得到了所创建文件的名称(Google文档)和提取的文本.

So far I get the name of the file created (Google Docs) and the extracted text.

我要实现的目标:在第三列上,我想获取与创建的Google文档相对应的PDF文件的链接

what I'm trying to achieve: on the 3rd column, I'd like to get the link of the PDF file corresponding to the Google docs created

我尝试将变量pdf包含在推数组中:

I've tried to include the variable pdf to the push array:

    var pdf = document.getId();
...
    output.push([name, body, pdf]);

但是我为所有文件获得相同的ID.理想情况下,我想获得指向PDF文件的完整链接,不仅要获取ID,还可以避免创建带有以下内容的列:

but I get the same ID for all files. Ideally I'd like to get the full link to the PDF file and not only it's ID, to avoid having to create a column with:

=CONCATENATE("https://drive.google.com/file/d/",C2)

其中C2包含PDF文件的ID.

where C2 contains the ID of the PDF file.

这是代码:

function extractTextOnOpen() {



var folderName = "OCR TESTS";

   var sheetId = "SHEET'S ID HERE";

  //Define Project folder

var folder = DriveApp.getFoldersByName(folderName).next();
var folderId = folder.getId();

//Find all PDFs in folder

var documents = folder.getFilesByType("application/pdf");
while (documents.hasNext()) {

    //Convert each PDF to a Google Doc with OCR
    var document = documents.next();

    // Get the PDF link to insert in the sheet for reference

    var pdf = document.getId();


    var imageName = document.getName();
    var docName = imageName.split("\.")[0];
    var file = {
        title: docName,
        mimeType: "application/pdf"
        // for images, use: "image/png"

    }
    Drive.Files.insert(file, document, { ocr: true });

    //Store newly-created Google Doc in the same project folder

    var newFile = DriveApp.getFilesByName(docName).next();
    folder.addFile(newFile);
    var rootFolder = DriveApp.getRootFolder();
    rootFolder.removeFile(newFile);
}

//Find all Google Docs in the project folder

var docs = folder.getFilesByType("application/vnd.google-apps.document");

//Set up spreadsheet

var ss = SpreadsheetApp.openById(sheetId);
SpreadsheetApp.setActiveSpreadsheet(ss);
Logger.log('File name: ' + ss.getName());


  // specify the sheet to insert the results

 var sheet = ss.getSheetByName('Sheet1');


// Set up the spreadsheet to display the results

 var headers = [["File Name", "Test Paper Scanned", "PDF Link"]];
  sheet.getRange("A1:C").clear()
  sheet.getRange("A1:C1").setValues(headers);


 var output = [];

//Populate spreadsheet with OCR text

while (docs.hasNext()) {
    var file = docs.next();
    var docId = file.getId();
    var doc = DocumentApp.openById(docId);
    var name = doc.getName();
    var body = doc.getBody().getText();

       output.push([name, body, pdf]);

   // write data to the sheet

  sheet.getRange(2, 1, output.length, 3).setValues(output);

}};

推荐答案

您有不必要的循环,因此失去了保存PDF URL的机会.我更改了代码的顺序,以向您展示其工作方式.本质上,所有工作都在您循环浏览PDF的第一个循环内进行.*

You have an unnecessary loop and so you're losing your opportunity to save the PDF URLs. I've changed the order of your code to show you how this works. Essentially, all of the work happens within the first loop where you iterate through the PDFs.*

function extractTextOnOpen() {
  var folderName = "OCR TESTS";
  var sheetId = "SHEET'S ID HERE";

  //Set up spreadsheet
  var ss = SpreadsheetApp.openById(sheetId);

  // specify the sheet to insert the results  
  var sheet = ss.getSheetByName("Sheet1");

  // Set up the spreadsheet to display the results
  var headers = ["File Name", "Test Paper Scanned", "PDF Link"];
  sheet.getRange("A1:C").clear()
  var output = [headers];

  //Define Project folder
  var folder = DriveApp.getFoldersByName(folderName).next();
  var folderId = folder.getId();

  //Find all PDFs in folder
  var pdfs = folder.getFilesByType("application/pdf");
  while (pdfs.hasNext()) {
    //Convert each PDF to a Google Doc with OCR
    var pdf = pdfs.next();    
    var imageName = pdf.getName();
    var docName = imageName.split("\.")[0];
    var file = {
      title: docName,
      mimeType: "application/pdf"
    };
    var driveFile = Drive.Files.insert(file, pdf, { ocr: true });

    //Store newly-created Google Doc in the same project folder
    var newFile = DriveApp.getFileById(driveFile.id);
    folder.addFile(newFile);
    var rootFolder = DriveApp.getRootFolder();
    rootFolder.removeFile(newFile);

    //Get the Google Doc data
    var doc = DocumentApp.openById(driveFile.id);
    var name = doc.getName();
    var body = doc.getBody().getText();
    output.push([name, body, pdf.getUrl()]);
  }
  //Print to the sheet
  sheet.getRange(1, 1, output.length, output[0].length).setValues(output);
}

在上面的代码中,请注意Drive API返回一个文件,因此将在进行中的.getFileById()调用中使用.然后,您可以使用 .getUrl() .getDownloadUrl() .

In the code above, note that the Drive API returns a file, so that's used in the proceeding .getFileById() calls. You can then use either .getUrl() or .getDownloadUrl().

var driveFile = Drive.Files.insert(file, pdf, { ocr: true });

//Store newly-created Google Doc in the same project folder
var newFile = DriveApp.getFileById(driveFile.id);

此外,您正在使用批处理.setValues(),该批处理速度更快,但是您将其循环了.我更新了脚本,使其仅在最后一次打印到图纸上.

Also, you're using the batch .setValues(), which is faster, but you have it in a loop. I updated the script to print to the sheet only once at the very end.

* 如果您真的要先经历两个PDF循环,然后再经历Google Docs,那么您需要在第一个循环中将PDF ID映射到Google Doc ID.

这篇关于将PDF转换为Google文档,并获取提取的文本以及指向原始PDF文件的链接到电子表格中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆