Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗? [英] Google Apps scraping script to run regullary till all site's inner pages are extracted?

查看：79 发布时间：2020/11/17 3:18:21 google-apps-script web-scraping

本文介绍了Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经完成了一个抓取脚本，该脚本逐个抓取来抓取任何站点的(要输入的URL)内部页面，获取其他内部URL，然后继续进行操作以获取所有页面并提取其纯文本(剥离的html). 该脚本运行良好，但是google脚本的运行时间限制为6分钟，因此，对于大型网站，它将无法正常运行(6分钟后停止，并且google doc文件中没有输出).

I've done a scraping script that scrapes any site's (url to be entered) inner pages one by one thru crawling, fetching other inner url and proceeding them to fetch all the pages and extract their pure text (stripped html). The script works well, yet the google script run limit is 6 min, so for huge sites it won't work (stopped after 6 min and no output in google doc file).

function onOpen() { 
    DocumentApp.getUi() // Or DocumentApp or FormApp.
      .createMenu('New scrape web docs')
      .addItem('Enter Url', 'showPrompt')
      .addToUi(); 
}

function showPrompt() { 
  var ui = DocumentApp.getUi();   
  var result = ui.prompt(
      'Scrape whole website into text!',
      'Please enter website url (with http(s)://):',
      ui.ButtonSet.OK_CANCEL); 

// Process the user's response.
  var button = result.getSelectedButton();
  var url = result.getResponseText();  
  var links=[];  
  var base_url = url; 

  if (button == ui.Button.OK) 
  {     
      // gather initial links 
      var inner_links_arr = scrapeAndPaste(url, 1); // first run and clear the document
      links = links.concat(inner_links_arr); // append an array to all the links
      var new_links=[]; // array for new links  
      var processed_urls =[url]; // processed links
      var link, current;

      while (links.length) 
      {  
         link = links.shift(); // get the most left link (inner url)
         processed_urls.push(link);
         current = base_url + link;  
         new_links = scrapeAndPaste(current, 0); // second and consecutive runs we do not clear up the document
         //ui.alert('Processed... ' + current                  + '\nReturned links: ' + new_links.join('\n') );
         // add new links into links array (stack) if appropriate
         for (var i in new_links){
           var item = new_links[i];
           if (links.indexOf(item) === -1 && processed_urls.indexOf(item) === -1)
               links.push(item);
         }    
     }
  } 
}

function scrapeAndPaste(url, clear) { 
  var text; 
  try {
    var html = UrlFetchApp.fetch(url).getContentText();
    // some html pre-processing 
    if (html.indexOf('</head>') !== -1 ){ 
       html = html.split('</head>')[1];
    }
    if (html.indexOf('</body>') !== -1 ){ // thus we split the body only
       html = html.split('</body>')[0] + '</body>';
    }       
   // fetch inner links
    var inner_links_arr= [];
    var linkRegExp = /href="(.*?)"/gi; // regex expression object 
    var match = linkRegExp.exec(html);
    while (match != null) {
      // matched text: match[0]
      if (match[1].indexOf('#') !== 0 
       && match[1].indexOf('http') !== 0 
       //&& match[1].indexOf('https://') !== 0  
       && match[1].indexOf('mailto:') !== 0 
       && match[1].indexOf('.pdf') === -1 ) {
         inner_links_arr.push(match[1]);
      }    
      // match start: match.index
      // capturing group n: match[n]
      match = linkRegExp.exec(html);
    }

    text = getTextFromHtml(html);
    outputText(url, text, clear); // output text into the current document with given url
    return inner_links_arr; //we return all inner links of this doc as array  

  } catch (e) { 
    MailApp.sendEmail(Session.getActiveUser().getEmail(), "Scrape error report at " 
      + Utilities.formatDate(new Date(), "GMT", "yyyy-MM-dd  HH:mm:ss"), 
      "\r\nMessage: " + e.message
      + "\r\nFile: " +  e.fileName+ '.gs' 
      + "\r\nWeb page under scrape: " + url
      + "\r\nLine: " +  e.lineNumber); 
    outputText(url, 'Scrape error for this page cause of malformed html!', clear);   
  } 
}

function getTextFromHtml(html) {
  return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
  switch(x.toString()) {
    case 'XmlText': return x.toXmlString();
    case 'XmlElement': return x.getNodes().map(getTextFromNode).join(' ');
    default: return '';
  }
}

function outputText(url, text, clear){
  var body = DocumentApp.getActiveDocument().getBody();
  if (clear){ 
    body.clear(); 
  }
  else {
    body.appendHorizontalRule();       
  }
  var section = body.appendParagraph(' * ' + url);
  section.setHeading(DocumentApp.ParagraphHeading.HEADING2);
  body.appendParagraph(text); 
}

我的想法是使用其他电子表格来保存抓取的链接，并自动定期(使用ScriptApp.newTrigger)重新启动常规脚本.但是出现了一些障碍:

My thought is to use additional spreadsheet to save scraped links and re-start the script on the regular base automatically (using ScriptApp.newTrigger). But some hindrances transpired:

通过触发调用时，脚本仅获得30秒的运行时间.
如果从触发器运行，则用户无法与脚本进行交互！我是否应该再次使用电子表格单元格输入初始基本网址?
由于运行限制时间(30秒或6分钟)而导致脚本停止运行之前，如何将抓取的内容刷新到google doc文件中?
如果所有站点链接都已处理，如何通过触发器停止脚本调用?

为方便起见，您可以分别回答每个问题.

You might answer each question separately for convenience.

Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗? [英] Google Apps scraping script to run regullary till all site's inner pages are extracted?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗? [英] Google Apps scraping script to run regullary till all site&#39;s inner pages are extracted?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Google Apps抓取脚本可以在提取所有网站的内页之前运行常规吗? [英] Google Apps scraping script to run regullary till all site's inner pages are extracted?

登录关闭