如何在不使用 XmlService 的情况下解析 Google Apps Script 中的 HTML 字符串? [英] How to parse an HTML string in Google Apps Script without using XmlService?

查看:15
本文介绍了如何在不使用 XmlService 的情况下解析 Google Apps Script 中的 HTML 字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Google 电子表格和 Google Apps 脚本创建一个抓取工具.我知道这是可能的,而且我看过一些关于它的教程和主题.

I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it.

主要思想是使用:

  var html = UrlFetchApp.fetch('http://en.wikipedia.org/wiki/Document_Object_Model').getContentText();
  var doc = XmlService.parse(html);

然后获取并使用元素.然而,该方法

And then get and work with the elements. However, the method

XmlService.parse()

不适用于某些页面.例如,如果我尝试:

Does not work for some page. For example, if I try:

function test(){
    var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
    var parse = XmlService.parse(html);
}

我收到以下错误:

Error on line 225: The entity name must immediately follow the '&' in the entity reference. (line 3, file "")

我尝试使用 string.replace() 来消除显然导致错误的字符,但它不起作用.出现各种其他错误.以下代码为例:

I've tried to use string.replace() to eliminate the characters that apparently are causing the error, but it does not work. All sort of other errors appear. The following code for example:

function test(){
    var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
    var regExp = new RegExp("&", "gi");
    html = html.replace(regExp,"");

    var parse = XmlService.parse(html);
}

给我以下错误:

Error on line 358: The content of elements must consist of well-formed character data or markup. (line 6, file "")

我认为这是 XmlService.parse() 方法的问题.

I believe this is a problem with the XmlService.parse() method.

我读过这个主题:

来自混乱 html 的 Google App 脚本解析表解析html的最佳方法是什么在谷歌应用程序脚本中,可以使用一种名为 xml.parse() 的已弃用方法,该方法接受允许解析 HTML 的第二个参数.但是,正如我所提到的,它已被弃用,而且我在任何地方都找不到有关它的任何文档.xml.parse() 似乎解析了字符串,但由于缺少文档,我在处理这些元素时遇到了麻烦.而且这也不是最安全的长期解决方案,因为它可能很快就会被停用.

Google App Script parse table from messed html and What is the best way to parse html in google apps script that one can use a deprecated method called xml.parse() which does accept a second parameter that allows parsing HTML. However, as I've mentioned, it is deprecated and I can not find any documentation on it anywhere. The xml.parse() seems to parse the string, but I have trouble working with the elements due to the lack of documentation. And it's also not the safest long term solution, because it can be deactivated any time soon.

那么,我想知道如何在 Google Apps Script 中解析此 HTML?

So, I want to know how do I parse this HTML in Google Apps Script?

我也试过:

function test(){

    var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();
    var htmlOutput = HtmlService.createHtmlOutput(html).getContent();

    var parse = XmlService.parse(htmlOutput);
}

但它不起作用,我收到此错误:

But it does not work, I get this error:

格式错误的 HTML 内容:

Malformed HTML content:

我想过使用开源库来解析 HTML,但我找不到任何.

I thought about using a open source library to parse the HTML, but I could not find any.

我的最终目标是从一组页面中获取一些信息,例如价格、链接、产品名称等.我已经使用一系列 RegEx 设法做到了这一点:

My ultimate goal is to get some information from a set of pages like Price, Link, Name of the products, etc. I've manage to do this using a series of RegEx:

var ss = SpreadsheetApp.getActiveSpreadsheet();
  var linksSheet = ss.getSheetByName("Links");
  var resultadosSheet = ss.getSheetByName("Resultados");

function scrapyLoco(){

  var links = linksSheet.getRange(1, 1, linksSheet.getLastRow(), 1).getValues();
  var arrayGrandao = [];
  for (var row =  0, len = links.length; row < len; row++){
   var link = links[row];


   var arrayDeResultados = pegarAsCoisas(link[0]);
   Logger.log(arrayDeResultados);
   arrayGrandao.push(arrayDeResultados);
  }   


  resultadosSheet.getRange(2, 1, arrayGrandao.length, arrayGrandao[0].length).setValues(arrayGrandao);

}


function pegarAsCoisas(linkDoProduto) {
  var resultadoArray = [];

  var html = UrlFetchApp.fetch(linkDoProduto).getContentText();
  var regExp = new RegExp("<h1([^]*)h1>", "gi");
  var h1Html = regExp.exec(html);
  var h1Parse = XmlService.parse(h1Html[0]);
  var h1Output = h1Parse.getRootElement().getText();
  h1Output = h1Output.replace(/(
|
|
|(^( )*))/gm,"");

  regExp = new RegExp("Ref.: ([^(])*", "gi");
  var codeHtml = regExp.exec(html);
  var codeOutput = codeHtml[0].replace("Ref.: ","").replace(" ","");

  regExp = new RegExp("margin-top: 5px; margin-bottom: 5px; padding: 5px; background-color: #699D15; color: #fff; text-align: center;([^]*)/div>", "gi");
  var descriptionHtml = regExp.exec(html);
  var regExp = new RegExp("<p([^]*)p>", "gi");
  var descriptionHtml = regExp.exec(descriptionHtml);
  var regExp = new RegExp("^[^.]*", "gi");
  var descriptionHtml = regExp.exec(descriptionHtml);
  var descriptionOutput = descriptionHtml[0].replace("<p>","");
  descriptionOutput = descriptionOutput+".";

  regExp = new RegExp("ecom(.+?)Main.png", "gi");
  var imageHtml = regExp.exec(html);
  var comecoDaURL = "https://www.nespresso.com/";
  var imageOutput = comecoDaURL+imageHtml[0];

  var regExp = new RegExp("nes_l-float nes_big-price nes_big-price-with-out([^]*)p>", "gi");
  var precoHtml = regExp.exec(html);
  var regExp = new RegExp("[0-9]*,", "gi");
  precoHtml = regExp.exec(precoHtml);
  var precoOutput = "BRL "+precoHtml[0].replace(",","");

  resultadoArray = [codeOutput,h1Output,descriptionOutput,"Home & Garden > Kitchen & Dining > Kitchen Appliances > Coffee Makers & Espresso Machines",
                    "Máquina",linkDoProduto,imageOutput,"new","in stock",precoOutput,"","","","Nespresso",codeOutput];

  return resultadoArray;
}

但是这样编程很费时间,很难动态改变,也不是很可靠.

But this is very timing consuming to program, it is very hard to change it dynamically and is not very reliable.

我需要一种方法来解析此 HTML 并轻松访问其元素.它实际上不是一个附加项.但一个简单的谷歌应用程序脚本..

I need a way to parse this HTML and easily access its elements. It´s actually not a add on. but a simple google app script..

推荐答案

我已经在 vanilla js 中做到了这一点.不是真正的 html 解析.尝试从字符串(url)中获取一些内容:

I have done this in vanilla js. Not real html parsing. Just try to get some content out of a string (url):

function getLKKBTC() {
  var url = 'https://www.lykke.com/exchange';
  var html = UrlFetchApp.fetch(url).getContentText();
  var searchstring = '<td class="ask_BTCLKK">';
  var index = html.search(searchstring);
  if (index >= 0) {
    var pos = index + searchstring.length
    var rate = html.substring(pos, pos + 6);
    rate = parseFloat(rate)
    rate = 1/rate
    return parseFloat(rate);
  }
  throw "Failed to fetch/parse data from " + url;
}

这篇关于如何在不使用 XmlService 的情况下解析 Google Apps Script 中的 HTML 字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆