从 Gmail 获取 pdf 附件作为文本 [英] Get pdf-attachments from Gmail as text

查看:34
本文介绍了从 Gmail 获取 pdf 附件作为文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网上搜索了一下Stack Overflow 但没有找到解决方案.我尝试做的是以下内容:我通过邮件获取某些附件,我希望将其作为(纯)文本以供进一步处理.我的脚本如下所示:

I searched around the web & Stack Overflow but didn't find a solution. What I try to do is the following: I get certain attachments via mail that I would like to have as (Plain) text for further processing. My script looks like this:

function MyFunction() {

  var threads = GmailApp.search ('label:templabel'); 
  var messages = GmailApp.getMessagesForThreads(threads); 

   for (i = 0; i < messages.length; ++i)
   {
     j = messages[i].length; 
   var messageBody = messages[i][0].getBody(); 
   var messageSubject = messages [i][0].getSubject();
     var attach = messages [i][0].getAttachments();
     var attachcontent = attach.getContentAsString();
    GmailApp.sendEmail("mail", messageSubject, "", {htmlBody: attachcontent});
    }
}

不幸的是,这不起作用.这里有人知道我该怎么做吗?甚至有可能吗?

Unfortunately this doesn't work. Does anybody here have an idea how I can do this? Is it even possible?

非常感谢您.

最好的,菲尔

推荐答案

针对 DriveApp 进行了更新,因为 DocsList 已弃用.

我建议将其分解为两个问题.第一个是如何从电子邮件中获取 pdf 附件,第二个是如何将该 pdf 转换为文本.

I suggest breaking this down into two problems. The first is how to get a pdf attachment from an email, the second is how to convert that pdf to text.

正如您所发现的,getContentAsString() 不会神奇地将 pdf 附件更改为纯文本或 html.我们需要做一些更复杂的事情.

As you've found out, getContentAsString() does not magically change a pdf attachment to plain text or html. We need to do something a little more complicated.

首先,我们会以Blob,一个被多个服务用来交换数据的实用程序类.

First, we'll get the attachment as a Blob, a utility class used by several Services to exchange data.

var blob = attachments[0].getAs(MimeType.PDF);

因此,将第二个问题分离出来,并保持假设我们只对标记为 templabel 的每个线程的第一条消息的第一个附件感兴趣,这里是 myFunction() 看起来:

So with the second problem separated out, and maintaining the assumption that we're interested in only the first attachment of the first message of each thread labeled templabel, here is how myFunction() looks:

/**
 * Get messages labeled 'templabel', and send myself the text contents of
 * pdf attachments in new emails.
 */
function myFunction() {

  var threads = GmailApp.search('label:templabel');
  var threadsMessages = GmailApp.getMessagesForThreads(threads);

  for (var thread = 0; thread < threadsMessages.length; ++thread) {
    var message = threadsMessages[thread][0];
    var messageBody = message.getBody();
    var messageSubject = message.getSubject();
    var attachments = message.getAttachments();

    var blob = attachments[0].getAs(MimeType.PDF);
    var filetext = pdfToText( blob, {keepTextfile: false} );

    GmailApp.sendEmail(Session.getActiveUser().getEmail(), messageSubject, filetext);
  }
}

我们依靠辅助函数 pdfToText() 将我们的 pdf blob 转换为文本,然后我们将其作为纯文本发送给我们自己电子邮件.这个辅助函数有多种选择;通过设置 keepTextfile: false,我们选择让它返回 PDF 文件的文本内容给我们,并且不会在我们的 Drive 中留下任何残留文件.

We're relying on a helper function, pdfToText(), to convert our pdf blob into text, which we'll then send to ourselves as a plain text email. This helper function has a variety of options; by setting keepTextfile: false, we've elected to just have it return the text content of the PDF file to us, and leave no residual files in our Drive.

此实用程序可作为要点.那里提供了几个示例.

This utility is available as a gist. Several examples are provided there.

A 之前的答案 表明可以使用 Drive API 的 insert 方法来执行 OCR,但它没有提供代码细节.随着高级 Google 服务的推出,可以从 Google Apps 脚本轻松访问 Drive API.您确实需要从编辑器中打开并启用 Drive API,在 Resources > 下.高级 Google 服务.

A previous answer indicated that it was possible to use the Drive API's insert method to perform OCR, but it didn't provide code details. With the introduction of Advanced Google Services, the Drive API is easily accessible from Google Apps Script. You do need to switch on and enable the Drive API from the editor, under Resources > Advanced Google Services.

pdfToText() 使用 Drive 服务从 PDF 文件的内容生成 Google Doc.不幸的是,这包含文档中每一页的图片"——我们对此无能为力.然后它使用常规的 DocumentService 将文档正文提取为纯文本.

pdfToText() uses the Drive service to generate a Google Doc from the content of the PDF file. Unfortunately, this contains the "pictures" of each page in the document - not much we can do about that. It then uses the regular DocumentService to extract the document body as plain text.

/**
 * See gist: https://gist.github.com/mogsdad/e6795e438615d252584f
 *
 * Convert pdf file (blob) to a text file on Drive, using built-in OCR.
 * By default, the text file will be placed in the root folder, with the same
 * name as source pdf (but extension 'txt'). Options:
 *   keepPdf (boolean, default false)     Keep a copy of the original PDF file.
 *   keepGdoc (boolean, default false)    Keep a copy of the OCR Google Doc file.
 *   keepTextfile (boolean, default true) Keep a copy of the text file.
 *   path (string, default blank)         Folder path to store file(s) in.
 *   ocrLanguage (ISO 639-1 code)         Default 'en'.
 *   textResult (boolean, default false)  If true and keepTextfile true, return
 *                                        string of text content. If keepTextfile
 *                                        is false, text content is returned without
 *                                        regard to this option. Otherwise, return
 *                                        id of textfile.
 *
 * @param {blob}   pdfFile    Blob containing pdf file
 * @param {object} options    (Optional) Object specifying handling details
 *
 * @returns {string}          id of text file (default) or text content
 */
function pdfToText ( pdfFile, options ) {
  // Ensure Advanced Drive Service is enabled
  try {
    Drive.Files.list();
  }
  catch (e) {
    throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
  }

  // Set default options
  options = options || {};
  options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;

  // Prepare resource object for file creation
  var parents = [];
  if (options.path) {
    parents.push( getDriveFolderFromPath (options.path) );
  }
  var pdfName = pdfFile.getName();
  var resource = {
    title: pdfName,
    mimeType: pdfFile.getContentType(),
    parents: parents
  };

  // Save PDF to Drive, if requested
  if (options.keepPdf) {
    var file = Drive.Files.insert(resource, pdfFile);
  }

  // Save PDF as GDOC
  resource.title = pdfName.replace(/pdf$/, 'gdoc');
  var insertOpts = {
    ocr: true,
    ocrLanguage: options.ocrLanguage || 'en'
  }
  var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);

  // Get text from GDOC  
  var gdocDoc = DocumentApp.openById(gdocFile.id);
  var text = gdocDoc.getBody().getText();

  // We're done using the Gdoc. Unless requested to keepGdoc, delete it.
  if (!options.keepGdoc) {
    Drive.Files.remove(gdocFile.id);
  }

  // Save text file, if requested
  if (options.keepTextfile) {
    resource.title = pdfName.replace(/pdf$/, 'txt');
    resource.mimeType = MimeType.PLAIN_TEXT;

    var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
    var textFile = Drive.Files.insert(resource, textBlob);
  }

  // Return result of conversion
  if (!options.keepTextfile || options.textResult) {
    return text;
  }
  else {
    return textFile.id
  }
}

实用程序来自 Bruce McPherson 有助于转换到 DriveApp:

The conversion to DriveApp is helped with this utility from Bruce McPherson:

// From: http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
  return (path || "/").split("/").reduce ( function(prev,current) {
    if (prev && current) {
      var fldrs = prev.getFoldersByName(current);
      return fldrs.hasNext() ? fldrs.next() : null;
    }
    else { 
      return current ? null : prev; 
    }
  },DriveApp.getRootFolder()); 
}

这篇关于从 Gmail 获取 pdf 附件作为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆