以文本格式从Gmail获取pdf附件 [英] Get pdf-attachments from Gmail as text

查看:117
本文介绍了以文本格式从Gmail获取pdf附件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网上搜索了一遍,堆栈溢出,但没有找到解决方案。我尝试做的事情如下:我通过邮件获取某些附件,我希望将其作为(普通)文本进行进一步处理。我的脚本如下所示:

  function MyFunction(){

var threads = GmailApp.search '标签:templabel');
var messages = GmailApp.getMessagesForThreads(threads);

for(i = 0; i< messages.length; ++ i)
{
j = messages [i] .length;
var messageBody = messages [i] [0] .getBody();
var messageSubject = messages [i] [0] .getSubject();
var attach = messages [i] [0] .getAttachments();
var attachcontent = attach.getContentAsString();
GmailApp.sendEmail(mail,messageSubject,,{htmlBody:attachcontent});
}
}

不幸的是,这是行不通的。有人在这里有一个想法,我怎么能做到这一点?甚至可能吗?



非常感谢您提前。



Best,Phil

解决方案

编辑:已针对DriveApp进行了更新,因为已弃用DocsList。








我建议将其分解为两个问题。首先是如何从电子邮件中获得pdf附件,其次是如何将该pdf转换为文本。

正如您发现的那样, getContentAsString()不会奇怪地将pdf附件更改为纯文本或html 。我们需要做一些更复杂的事情。



首先,我们会将附件作为 Blob ,这是多个服务用于交换数据的工具类。

  var blob = attachments [0] .getAs(MimeType.PDF); 

因此,将第二个问题分离出来并保持假设我们只对第一个问题感兴趣每个线程标记为 templabel 的第一条消息的附件,这里是 myFunction()的样子:

/ **
*获取标签为templabel的邮件,并发送自己的文本内容
* pdf附件在新电子邮件中。
* /
函数myFunction(){

var threads = GmailApp.search('label:templabel');
var threadsMessages = GmailApp.getMessagesForThreads(threads);

for(var thread = 0; thread< threadsMessages.length; ++ thread){
var message = threadsMessages [thread] [0];
var messageBody = message.getBody();
var messageSubject = message.getSubject();
var attachments = message.getAttachments();

var blob =附件[0] .getAs(MimeType.PDF);
var filetext = pdfToText(blob,{keepTextfile:false});

GmailApp.sendEmail(Session.getActiveUser()。getEmail(),messageSubject,filetext);




$ b我们依赖一个辅助函数 pdfToText(),将我们的pdf blob 转换为文本,然后我们将其作为纯文本电子邮件发送给自己。这个辅助函数有多种选择;通过设置 keepTextfile:false ,我们选择让它将PDF文件的文本内容返回给我们,并且不会在我们的驱动器中留下任何残留文件。



pdfToText()



此实用程序可用作为要点。这里提供了几个例子。



A 上一个答案表示可以使用Drive API的插入方法来执行 OCR ,但它没有提供代码细节。随着高级Google服务的推出,可通过Google Apps脚本轻松访问Drive API。您需要在编辑器中打开并启用 Drive API ,在 Resources>下,高级Google服务
$ b pdfToText()使用云端硬盘服务生成Google Doc来自PDF文件的内容。不幸的是,这包含了文档中每个页面的图片 - 对此我们无能为力。然后它使用常规的 DocumentService 将文档正文解压为纯文本。
$ b / **
*请参阅:https://gist.github。 com / mogsdad / e6795e438615d252584f
*
*使用内置的OCR将pdf文件(blob)转换为Drive上的文本文件。
*默认情况下,文本文件将被放置在根文件夹中,与源pdf(但扩展名为txt)具有相同的
*名称。选项:
* keepPdf(boolean,默认为false)保留原始PDF文件的副本。
* keepGdoc(boolean,默认为false)保留OCR Google Doc文件的副本。
* keepTextfile(布尔值,默认为true)保留文本文件的副本。
*路径(字符串,默认空白)用于存储文件的文件夹路径。
* ocrLanguage(ISO 639-1代码)默认值'en'。
* textResult(boolean,default false)如果true和keepTextfile为true,则返回
*字符串的文本内容。如果keepTextfile
*为false,则返回的文本内容不包含关于此选项的
*。否则,返回
* id的文本文件。
*
* @param {blob} pdfFile包含pdf文件的Blob
* @param {object} options(可选)指定处理详细信息的对象
*
* @returns {字符串}文本文件的ID(默认)或文本内容
* /
函数pdfToText(pdfFile,选项){
//确保高级云端硬盘服务已启用
try {
Drive.Files.list();
catch(e){
throw new Error(要使用pdfToText(),首先在资源>高级Google服务中启用Drive API。
}

//设置默认选项
选项=选项|| {};
options.keepTextfile = options.hasOwnProperty(keepTextfile)? options.keepTextfile:true;

//为文件创建准备资源对象
var parents = [];
if(options.path){
parents.push(getDriveFolderFromPath(options.path));
}
var pdfName = pdfFile.getName();
var resource = {
title:pdfName,
mimeType:pdfFile.getContentType(),
父母:父母
};

//将PDF保存到云端硬盘,如果请求
if(options.keepPdf){
var file = Drive.Files.insert(resource,pdfFile);
}

//将PDF另存为GDOC
resource.title = pdfName.replace(/ pdf $ /,'gdoc');
var insertOpts = {
ocr:true,
ocrLanguage:options.ocrLanguage || 'en'
}
var gdocFile = Drive.Files.insert(resource,pdfFile,insertOpts);

//从GDOC获取文本
var gdocDoc = DocumentApp.openById(gdocFile.id);
var text = gdocDoc.getBody()。getText();

//我们完成了使用Gdoc。除非请求保留Gdoc,否则删除它。
if(!options.keepGdoc){
Drive.Files.remove(gdocFile.id);
}

//保存文本文件,如果请求
if(options.keepTextfile){
resource.title = pdfName.replace(/ pdf $ /,'文本');
resource.mimeType = MimeType.PLAIN_TEXT;

var textBlob = Utilities.newBlob(text,MimeType.PLAIN_TEXT,resource.title);
var textFile = Drive.Files.insert(resource,textBlob);
}

//返回转换结果
if(!options.keepTextfile || options.textResult){
return text;
}
else {
return textFile.id
}
}

通过此布鲁斯麦克弗森的实用程序可以帮助您转换为DriveApp

  //发件人:http://ramblings.mcpher.com/ Home / excelquirks / gooscript / driveapppathfolder 
function getDriveFolderFromPath(path){
return(path ||/\").split(\"/\").reduce(function(prev,current){
if(prev&& current){
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext()?fldrs.next():null;
}
else {
return current?null:prev;
}
},DriveApp.getRootFolder());
}


I searched around the web & Stack Overflow but didn't find a solution. What I try to do is the following: I get certain attachments via mail that I would like to have as (Plain) text for further processing. My script looks like this:

function MyFunction() {

  var threads = GmailApp.search ('label:templabel'); 
  var messages = GmailApp.getMessagesForThreads(threads); 

   for (i = 0; i < messages.length; ++i)
   {
     j = messages[i].length; 
   var messageBody = messages[i][0].getBody(); 
   var messageSubject = messages [i][0].getSubject();
     var attach = messages [i][0].getAttachments();
     var attachcontent = attach.getContentAsString();
    GmailApp.sendEmail("mail", messageSubject, "", {htmlBody: attachcontent});
    }
}

Unfortunately this doesn't work. Does anybody here have an idea how I can do this? Is it even possible?

Thank you very much in advance.

Best, Phil

解决方案

Edit: Updated for DriveApp, as DocsList deprecated.


I suggest breaking this down into two problems. The first is how to get a pdf attachment from an email, the second is how to convert that pdf to text.

As you've found out, getContentAsString() does not magically change a pdf attachment to plain text or html. We need to do something a little more complicated.

First, we'll get the attachment as a Blob, a utility class used by several Services to exchange data.

var blob = attachments[0].getAs(MimeType.PDF);

So with the second problem separated out, and maintaining the assumption that we're interested in only the first attachment of the first message of each thread labeled templabel, here is how myFunction() looks:

/**
 * Get messages labeled 'templabel', and send myself the text contents of
 * pdf attachments in new emails.
 */
function myFunction() {

  var threads = GmailApp.search('label:templabel');
  var threadsMessages = GmailApp.getMessagesForThreads(threads);

  for (var thread = 0; thread < threadsMessages.length; ++thread) {
    var message = threadsMessages[thread][0];
    var messageBody = message.getBody();
    var messageSubject = message.getSubject();
    var attachments = message.getAttachments();

    var blob = attachments[0].getAs(MimeType.PDF);
    var filetext = pdfToText( blob, {keepTextfile: false} );

    GmailApp.sendEmail(Session.getActiveUser().getEmail(), messageSubject, filetext);
  }
}

We're relying on a helper function, pdfToText(), to convert our pdf blob into text, which we'll then send to ourselves as a plain text email. This helper function has a variety of options; by setting keepTextfile: false, we've elected to just have it return the text content of the PDF file to us, and leave no residual files in our Drive.

pdfToText()

This utility is available as a gist. Several examples are provided there.

A previous answer indicated that it was possible to use the Drive API's insert method to perform OCR, but it didn't provide code details. With the introduction of Advanced Google Services, the Drive API is easily accessible from Google Apps Script. You do need to switch on and enable the Drive API from the editor, under Resources > Advanced Google Services.

pdfToText() uses the Drive service to generate a Google Doc from the content of the PDF file. Unfortunately, this contains the "pictures" of each page in the document - not much we can do about that. It then uses the regular DocumentService to extract the document body as plain text.

/**
 * See gist: https://gist.github.com/mogsdad/e6795e438615d252584f
 *
 * Convert pdf file (blob) to a text file on Drive, using built-in OCR.
 * By default, the text file will be placed in the root folder, with the same
 * name as source pdf (but extension 'txt'). Options:
 *   keepPdf (boolean, default false)     Keep a copy of the original PDF file.
 *   keepGdoc (boolean, default false)    Keep a copy of the OCR Google Doc file.
 *   keepTextfile (boolean, default true) Keep a copy of the text file.
 *   path (string, default blank)         Folder path to store file(s) in.
 *   ocrLanguage (ISO 639-1 code)         Default 'en'.
 *   textResult (boolean, default false)  If true and keepTextfile true, return
 *                                        string of text content. If keepTextfile
 *                                        is false, text content is returned without
 *                                        regard to this option. Otherwise, return
 *                                        id of textfile.
 *
 * @param {blob}   pdfFile    Blob containing pdf file
 * @param {object} options    (Optional) Object specifying handling details
 *
 * @returns {string}          id of text file (default) or text content
 */
function pdfToText ( pdfFile, options ) {
  // Ensure Advanced Drive Service is enabled
  try {
    Drive.Files.list();
  }
  catch (e) {
    throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
  }

  // Set default options
  options = options || {};
  options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;

  // Prepare resource object for file creation
  var parents = [];
  if (options.path) {
    parents.push( getDriveFolderFromPath (options.path) );
  }
  var pdfName = pdfFile.getName();
  var resource = {
    title: pdfName,
    mimeType: pdfFile.getContentType(),
    parents: parents
  };

  // Save PDF to Drive, if requested
  if (options.keepPdf) {
    var file = Drive.Files.insert(resource, pdfFile);
  }

  // Save PDF as GDOC
  resource.title = pdfName.replace(/pdf$/, 'gdoc');
  var insertOpts = {
    ocr: true,
    ocrLanguage: options.ocrLanguage || 'en'
  }
  var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);

  // Get text from GDOC  
  var gdocDoc = DocumentApp.openById(gdocFile.id);
  var text = gdocDoc.getBody().getText();

  // We're done using the Gdoc. Unless requested to keepGdoc, delete it.
  if (!options.keepGdoc) {
    Drive.Files.remove(gdocFile.id);
  }

  // Save text file, if requested
  if (options.keepTextfile) {
    resource.title = pdfName.replace(/pdf$/, 'txt');
    resource.mimeType = MimeType.PLAIN_TEXT;

    var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
    var textFile = Drive.Files.insert(resource, textBlob);
  }

  // Return result of conversion
  if (!options.keepTextfile || options.textResult) {
    return text;
  }
  else {
    return textFile.id
  }
}

The conversion to DriveApp is helped with this utility from Bruce McPherson:

// From: http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
  return (path || "/").split("/").reduce ( function(prev,current) {
    if (prev && current) {
      var fldrs = prev.getFoldersByName(current);
      return fldrs.hasNext() ? fldrs.next() : null;
    }
    else { 
      return current ? null : prev; 
    }
  },DriveApp.getRootFolder()); 
}

这篇关于以文本格式从Gmail获取pdf附件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆