使用Drive API/DriveApp将PDF转换为Google文档 [英] Using Drive API / DriveApp to convert from PDFs to Google Documents

查看:86
本文介绍了使用Drive API/DriveApp将PDF转换为Google文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题已成功解决.我正在编辑我的帖子,以记录我的经历,以供后代和将来参考.

This problem has been successfully resolved. I am editing my post to document my experience for posterity and future reference.

我有117个PDF文件(平均大小约为238 KB)上传到Google云端硬盘.我想将它们全部转换为Google文档,并将它们保存在其他Drive文件夹中.

I have 117 PDF files (average size ~238 KB) uploaded to Google Drive. I want to convert them all to Google Docs and keep them in a different Drive folder.

我尝试使用 Drive.Files.insert .但是,在大多数情况下,在此错误导致函数过早失效之前,只能以这种方式转换5个文件

I attempted to convert the files using Drive.Files.insert. However, under most circumstances, only 5 files could be converted this way before the function expires prematurely with this error

超出限制:DriveApp. (第#行,文件代码")

Limit Exceeded: DriveApp. (line #, file "Code")

上面引用的行是调用insert函数时.首次调用此函数后,后续调用通常会立即失败,而不会创建其他google文档.

where the line referenced above is when the insert function is called. After calling this function for the first time, subsequent calls typically failed immediately with no additional google doc created.

我使用3种主要方法来实现自己的目标.如上所述,其中之一是使用 Drive.Files.insert .另外两个涉及使用 Drive.Files.copy 并发送<一个href ="https://developers.google.com/drive/v3/web/batch" rel ="nofollow noreferrer"> HTTP请求的批处理. Tanaike建议了这最后两种方法,我建议阅读下面的答案以获取更多信息. insertcopy函数来自 Google Drive REST v2 API ,而批处理多个HTTP请求来自Drive REST v3.

I used 3 main ways to achieve my goal. One was using the Drive.Files.insert, as mentioned above. The other two involved using Drive.Files.copy and sending a batch of HTTP requests. These last two methods were suggested by Tanaike, and I recommend reading his answer below for more information. The insert and copy functions are from Google Drive REST v2 API, while batching multiple HTTP requests is from Drive REST v3.

使用 Drive.Files.insert 时,我遇到了一些问题具有执行限制(在上面的问题"部分中进行了说明).一种解决方案是多次运行这些功能.为此,我需要一种方法来跟踪转换了哪些文件.为此,我有两个选择:使用电子表格和延续令牌.因此,我有4种不同的测试方法:本段中提到的两种,绑定HTTP请求,然后调用 Drive.Files.copy .

With Drive.Files.insert, I experienced issues dealing with execution limitations (explained in the Problem section above). One solution was to run the functions multiple times. And for that, I needed a way to keep track of which files were converted. I had two options for this: using a spreadsheet and a continuation token. Therefore, I had 4 different methods to test: the two mentioned in this paragraph, batching HTTP requests, and calling Drive.Files.copy.

由于团队驱动器的行为与常规驱动器不同,有必要尝试每种方法两次,一种方法是其中包含PDF的文件夹是常规的非Team Drive文件夹,另一种方法是该文件夹在Team Drive下.总共,这意味着我有 8 种不同的测试方法.

Because team drives behave differently from regular drives, I felt it necessary to try each of those methods twice, one in which the folder containing the PDFs is a regular non-Team Drive folder and one in which that folder is under a Team Drive. In total, this means I had 8 different methods to test.

这些是我使用的确切功能.其中每个都使用了两次,唯一的变化是源文件夹和目标文件夹的ID(出于上述原因):

These are the exact functions I used. Each of these was used twice, with the only variations being the ID of the source and destination folders (for reasons stated above):

function toDocs() {
  var sheet = SpreadsheetApp.openById(/* spreadsheet id*/).getSheets()[0];
  var range = sheet.getRange("A2:E118");
  var table = range.getValues();
  var len = table.length;
  var resources = {
    title: null,
    mimeType: MimeType.GOOGLE_DOCS,
    parents: [{id: /* destination folder id */}]
  };
  var count = 0;
  var files = DriveApp.getFolderById(/* source folder id */).getFiles();
  while (files.hasNext()) {
    var blob = files.next().getBlob();
    var blobName = blob.getName();
    for (var i=0; i<len; i++) {
      if (table[i][0] === blobName.slice(5, 18)) {
        if (table[i][4])
          break;
        resources.title = blobName;
        Drive.Files.insert(resources, blob);  // Limit Exceeded: DriveApp. (line 51, file "Code")
        table[i][4] = "yes";
      }
    }

    if (++count === 10) {
      range.setValues(table);
      Logger.log("time's up");
    }
  }
}

方法B:使用 Drive.Files.insert 和一个连续令牌

Method B: Using Drive.Files.insert and a continuation token

function toDocs() {
  var folder = DriveApp.getFolderById(/* source folder id */);
  var sprop = PropertiesService.getScriptProperties();
  var contToken = sprop.getProperty("contToken");
  var files = contToken ? DriveApp.continueFileIterator(contToken) : folder.getFiles();
  var options = {
    ocr: true
  };
  var resource = {
    title: null,
    mimeType: null,
    parents: [{id: /* destination folder id */}]
  };

  while (files.hasNext()) {
    var blob = files.next().getBlob();
    resource.title = blob.getName();
    resource.mimeType = blob.getContentType();
    Drive.Files.insert(resource, blob, options);  // Limit Exceeded: DriveApp. (line 113, file "Code")
    sprop.setProperty("contToken", files.getContinuationToken());
  }
}

方法C:使用 Drive.Files.copy

此功能的贷方为Tanaike-有关更多详细信息,请参见下面的答案.

Method C: Using Drive.Files.copy

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
    Logger.log(res) 
  }
}

方法D:发送一批HTTP请求

此功能的贷方为Tanaike-有关更多详细信息,请参见下面的答案.

Method D: Sending batches of HTTP requests

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 20; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "\r\n";
    body.forEach(function(e){
      data += "Content-Type: application/http\r\n";
      data += "Content-ID: " + ++contentId + "\r\n\r\n";
      data += e.method + " " + e.endpoint + "\r\n";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
      data += "--" + boundary + "\r\n";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

什么可行,什么没成功

  • 没有使用 Drive.Files.insert 的功能工作.每一个 insert进行转换的函数失败,并显示此错误

    What Worked and What Didn't

    • None of the functions using Drive.Files.insert worked. Every function using insert for conversion failed with this error

      超出限制:DriveApp. (第#行,文件代码")

      Limit Exceeded: DriveApp. (line #, file "Code")

      (用通用符号替换行号).没有更多详细信息或 可以找到错误的描述.一个显着的变化是 在其中我使用了电子表格,而PDF在团队合作中 文件夹;而所有其他方法都立即失败,而没有转换 单个文件,此文件在失败前已转换为5.但是,当 考虑到为什么这种变体比其他变体更好,我认为 与其说是与使用特定语言有关的任何原因,不如说是fl幸 资源(电子表格,团队合作精神等)

      (line number replaced with generic symbol). No further details or description of the error could be found. A notable variation was one in which I used a spreadsheet and the PDFs were in a team drive folder; while all other methods failed instantly without converting a single file, this one converted 5 before failing. However, when considering why this variation did better than the others, I think it was more of a fluke than any reason related to the use of particular resources (spreadsheet, team drive, etc.)

      使用 Drive.Files.copy 批量HTTP请求仅适用 当源文件夹是个人(非Team Drive)文件夹时.

      Using Drive.Files.copy and batch HTTP requests worked only when the source folder was a personal (non-Team Drive) folder.

      尝试从Team Drive读取时使用copy功能 文件夹因以下错误而失败:

      Attempting to use the copy function while reading from a Team Drive folder fails with this error:

      找不到文件:1RAGxe9a_-euRpWm3ePrbaGaX5brpmGXu(行号,文件代码")

      File not found: 1RAGxe9a_-euRpWm3ePrbaGaX5brpmGXu (line #, file "Code")

      (用通用符号替换行号).被引用的行 是

      (line number replaced with generic symbol). The line being referenced is

      var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
      

    • 在从Team Drive读取数据时,使用批量HTTP请求文件夹 不执行任何操作-不创建任何doc文件,也不会引发任何错误. 函数会在没有完成任何操作的情况下静默终止.

    • Using batch HTTP requests while reading from a Team Drive folder does nothing -- no doc files are created and no errors are thrown. Function silently terminates without having accomplished anything.

      如果您希望将大量PDF转换为google docs或文本文件,请使用发送HTTP请求批量,并确保PDF存储在个人驱动器而不是团队驱动器中.

      If you wish to convert a large number of PDFs to google docs or text files, then use Drive.Files.copy or send batches of HTTP requests and make sure that the PDFs are stored in a personal drive rather than a Team Drive.

      特别感谢@tehhowch对我的问题表现出如此强烈的兴趣并多次回覆以提供反馈,并感谢@Tanaike提供代码以及成功解决了我的问题的解释(请注意,阅读有关详细信息,请参见上文.)

      推荐答案

      您要将文件夹中的PDF文件转换为Google文档. PDF文件位于团队驱动器的文件夹中.您想将转换后的文件导入到Google云端硬盘的文件夹中.如果我的理解是正确的,那么这种方法呢?

      You want to convert from PDF files in the folder to Google Documents. PDF files are in a folder of team drive. You want to import converted them to a folder of your Google Drive. If my understanding is correct, how about this method?

      对于从PDF到Google Document的转换,它不仅可以使用Drive.Files.insert()进行转换,还可以使用Drive.Files.copy()进行转换.使用Drive.Files.copy()的优点是

      For the conversion from PDF to Google Document, it can convert using not only Drive.Files.insert(), but also Drive.Files.copy(). The advantage of use of Drive.Files.copy() is

      • 尽管Drive.Files.insert()的大小限制为5 MB,但Drive.Files.copy()可以使用5 MB的大小.
      • 在我的环境中,处理速度比Drive.Files.insert()快.
      • Although Drive.Files.insert() has the size limitation of 5 MB, Drive.Files.copy() can use over the size of 5 MB.
      • In my envoronment, the process speed was faster than Drive.Files.insert().

      对于这种方法,我想提出以下两种模式.

      For this method, I would like to propose the following 2 patterns.

      在这种情况下,高级Google服务的Drive API v2用于转换文件.

      In this case, Drive API v2 of Advanced Google Services is used for converting files.

      function myFunction() {
        var sourceFolderId = "/* source folder id */";
        var destinationFolderId = "/* dest folder id */";
        var files = DriveApp.getFolderById(sourceFolderId).getFiles();
        while (files.hasNext()) {
          var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
      //    Logger.log(res) // If you use this, please remove the comment.
        }
      }
      

      模式2:使用Drive API v3

      在这种情况下,Drive API v3用于转换文件.在这里,我将批处理请求用于这种情况.因为批处理请求可以通过一个API调用使用100个API调用.这样,可以消除API配额问题.

      Pattern 2 : Using Drive API v3

      In this case, Drive API v3 is used for converting files. And here, I used the batch requests for this situation. Because the batch requests can use 100 API calls by one API call. By this, the issue of API quota can be removed.

      function myFunction() {
        var sourceFolderId = "/* source folder id */";
        var destinationFolderId = "/* dest folder id */";
      
        var files = DriveApp.getFolderById(sourceFolderId).getFiles();
        var rBody = [];
        while (files.hasNext()) {
          rBody.push({
            method: "POST",
            endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
            requestBody: {
              mimeType: "application/vnd.google-apps.document",
              parents: [destinationFolderId]
            }
          });
        }
        var cycle = 100; // Number of API calls at 1 batch request.
        for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
          var offset = i * cycle;
          var body = rBody.slice(offset, offset + cycle);
          var boundary = "xxxxxxxxxx";
          var contentId = 0;
          var data = "--" + boundary + "\r\n";
          body.forEach(function(e){
            data += "Content-Type: application/http\r\n";
            data += "Content-ID: " + ++contentId + "\r\n\r\n";
            data += e.method + " " + e.endpoint + "\r\n";
            data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
            data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
            data += "--" + boundary + "\r\n";
          });
          var options = {
            method: "post",
            contentType: "multipart/mixed; boundary=" + boundary,
            payload: Utilities.newBlob(data).getBytes(),
            headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
            muteHttpExceptions: true,
          };
          var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
      //    Logger.log(res); // If you use this, please remove the comment.
        }
      }
      

      注意:

      • 如果在1个批处理请求中调用的API数量很大(当前值为100),请修改var cycle = 100.
      • 如果Drive API v3无法用于团队合作,请告诉我.我可以将其转换为Drive API v2.
      • 如果您遇到的问题是团队合作的原因,那么在将PDF文件复制到Google云端硬盘后,您可以尝试这样做吗?
      • Note :

        • If the number of API calls at 1 batch request is large (the current value is 100), please modify var cycle = 100.
        • If Drive API v3 cannot be used for team drive, please tell me. I can convert it for Drive API v2.
        • If the team drive is the reason of issue for your situation, can you try this after it copied PDF files to your Google Drive?
        • 如果这些对您没有用,对不起.

          If these are not useful for you, I'm sorry.

          这篇关于使用Drive API/DriveApp将PDF转换为Google文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆