Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗? [英] Google Apps scraping script to run regullary till all site's inner pages are extracted?

查看:79
本文介绍了Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经完成了一个抓取脚本,该脚本逐个抓取来抓取任何站点的(要输入的URL)内部页面,获取其他内部URL,然后继续进行操作以获取所有页面并提取其纯文本(剥离的html). 该脚本运行良好,但是google脚本的运行时间限制为6分钟,因此,对于大型网站,它将无法正常运行(6分钟后停止,并且google doc文件中没有输出).

I've done a scraping script that scrapes any site's (url to be entered) inner pages one by one thru crawling, fetching other inner url and proceeding them to fetch all the pages and extract their pure text (stripped html). The script works well, yet the google script run limit is 6 min, so for huge sites it won't work (stopped after 6 min and no output in google doc file).

function onOpen() { 
    DocumentApp.getUi() // Or DocumentApp or FormApp.
      .createMenu('New scrape web docs')
      .addItem('Enter Url', 'showPrompt')
      .addToUi(); 
}

function showPrompt() { 
  var ui = DocumentApp.getUi();   
  var result = ui.prompt(
      'Scrape whole website into text!',
      'Please enter website url (with http(s)://):',
      ui.ButtonSet.OK_CANCEL); 

// Process the user's response.
  var button = result.getSelectedButton();
  var url = result.getResponseText();  
  var links=[];  
  var base_url = url; 

  if (button == ui.Button.OK) 
  {     
      // gather initial links 
      var inner_links_arr = scrapeAndPaste(url, 1); // first run and clear the document
      links = links.concat(inner_links_arr); // append an array to all the links
      var new_links=[]; // array for new links  
      var processed_urls =[url]; // processed links
      var link, current;

      while (links.length) 
      {  
         link = links.shift(); // get the most left link (inner url)
         processed_urls.push(link);
         current = base_url + link;  
         new_links = scrapeAndPaste(current, 0); // second and consecutive runs we do not clear up the document
         //ui.alert('Processed... ' + current                  + '\nReturned links: ' + new_links.join('\n') );
         // add new links into links array (stack) if appropriate
         for (var i in new_links){
           var item = new_links[i];
           if (links.indexOf(item) === -1 && processed_urls.indexOf(item) === -1)
               links.push(item);
         }    
     }
  } 
}

function scrapeAndPaste(url, clear) { 
  var text; 
  try {
    var html = UrlFetchApp.fetch(url).getContentText();
    // some html pre-processing 
    if (html.indexOf('</head>') !== -1 ){ 
       html = html.split('</head>')[1];
    }
    if (html.indexOf('</body>') !== -1 ){ // thus we split the body only
       html = html.split('</body>')[0] + '</body>';
    }       
   // fetch inner links
    var inner_links_arr= [];
    var linkRegExp = /href="(.*?)"/gi; // regex expression object 
    var match = linkRegExp.exec(html);
    while (match != null) {
      // matched text: match[0]
      if (match[1].indexOf('#') !== 0 
       && match[1].indexOf('http') !== 0 
       //&& match[1].indexOf('https://') !== 0  
       && match[1].indexOf('mailto:') !== 0 
       && match[1].indexOf('.pdf') === -1 ) {
         inner_links_arr.push(match[1]);
      }    
      // match start: match.index
      // capturing group n: match[n]
      match = linkRegExp.exec(html);
    }

    text = getTextFromHtml(html);
    outputText(url, text, clear); // output text into the current document with given url
    return inner_links_arr; //we return all inner links of this doc as array  

  } catch (e) { 
    MailApp.sendEmail(Session.getActiveUser().getEmail(), "Scrape error report at " 
      + Utilities.formatDate(new Date(), "GMT", "yyyy-MM-dd  HH:mm:ss"), 
      "\r\nMessage: " + e.message
      + "\r\nFile: " +  e.fileName+ '.gs' 
      + "\r\nWeb page under scrape: " + url
      + "\r\nLine: " +  e.lineNumber); 
    outputText(url, 'Scrape error for this page cause of malformed html!', clear);   
  } 
}

function getTextFromHtml(html) {
  return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
  switch(x.toString()) {
    case 'XmlText': return x.toXmlString();
    case 'XmlElement': return x.getNodes().map(getTextFromNode).join(' ');
    default: return '';
  }
}

function outputText(url, text, clear){
  var body = DocumentApp.getActiveDocument().getBody();
  if (clear){ 
    body.clear(); 
  }
  else {
    body.appendHorizontalRule();       
  }
  var section = body.appendParagraph(' * ' + url);
  section.setHeading(DocumentApp.ParagraphHeading.HEADING2);
  body.appendParagraph(text); 
} 

我的想法是使用其他电子表格来保存抓取的链接,并自动定期(使用ScriptApp.newTrigger)重新启动常规脚本.但是出现了一些障碍:

My thought is to use additional spreadsheet to save scraped links and re-start the script on the regular base automatically (using ScriptApp.newTrigger). But some hindrances transpired:

  1. 通过触发调用时,脚本仅获得30秒的运行时间.
  2. 如果从触发器运行,则用户无法与脚本进行交互!我是否应该再次使用电子表格单元格输入初始基本网址?
  3. 由于运行限制时间(30秒或6分钟)而导致脚本停止运行之前,如何将抓取的内容刷新到google doc文件中?
  4. 如果所有站点链接都已处理,如何通过触发器停止脚本调用?

为方便起见,您可以分别回答每个问题.

You might answer each question separately for convenience.

推荐答案

  1. AFAIK,您需要在两次触发之间至少保留6分钟,然后它将再运行6分钟.

  1. AFAIK, you need to give at least 6 minutes between triggers, then it will run for another 6 minutes.

您可以一次请求所有URL,并将它们保存在属性中,然后在触发器中调用属性.

You can ask for all URLs at once and save them in properties, then call the properties in the trigger.

您可以定期检查时间,知道它将只运行6分钟,如果到达5分钟,则将其全部粘贴,然后设置触发器.

You can make regular checks of the time, knowing it will run for only 6minutes, if it reaches 5min, paste all then set trigger.

将对象与当前需要处理的链接保存到属性中,然后在触发器调用脚本时,它仅检索需要处理的URL.

Save the object with the current links that needs to be processed in properties, then when script is invoked by the trigger, it retrieves only the URLs that need to be processed.

您可能无法将整个网站保存在属性中,因为它的限制为100kb,但是您可以将每个页面分成不同的属性,如果这样可以达到限制,则不知道.

You probably won't be able to save whole website in properties since it has a 100kb limit, but you can split every page into a diferent property, dunno if it can hit a limit that way.

另一种替代方法是使检索调用与HTMLService或setTimeout异步运行.我没有在GAS脚本中使用setTimeout,但是在HTML Javascript中效果很好.

Another alternative is to make retrieve calls run asynchronously, with HTMLService or setTimeout. I haven't used setTimeout in GAS scripting, but works great in HTML Javascript.

这篇关于Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆