Google 电子表格 ImportXML/XPath - 图像损坏的输出 [英] Google Spreadsheets ImportXML / XPath - Image broken output

查看:61
本文介绍了Google 电子表格 ImportXML/XPath - 图像损坏的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 XPath 通过 Google 表格中的 ImportXML 函数抓取我网站的图像.该功能似乎有效,但输出似乎已损坏.我没有获取图像的 URL,而是收到以下损坏的输出:

data:image/svg+xml,%3Csvg xmlns=http://www.w3.org/2000/svg"视图框=0 0520 390"%3E%3C/svg%3E

我正在使用以下 XPath:

//div[contains(@class,'MuiGrid-root-59 jss358 MuiGrid-item-61')]//img/@src

我尝试了几件事,但现在有点卡住了.有谁知道如何处理(和解决)这个编码错误?

编辑 I:在此也是 Google 表格的链接:

参考:

I am trying to scrape an image of my website via the ImportXML function in Google Sheets using XPath. The function seems to work, however the output seems to be broken. Instead of getting the URL of the image, I am receiving the following broken output:

data:image/svg+xml,%3Csvg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 
520 390"%3E%3C/svg%3E

I am using the following XPath:

//div[contains(@class,'MuiGrid-root-59 jss358 MuiGrid-item-61')]//img/@src

I tried several things, but I am bit stuck now. Does anyone know how to deal (and solve) with this encoding error?

Edit I: hereby also the link to the Google sheet: example image broken in Google Sheets

*Edit II: I've noted that the image is changing, so the above XPath is not working anymore

解决方案

When I saw the HTML of https://next.ibood.com/nl/nl/electronics, it seems that the value you want to retrieve is created by Javascript. So, unfortunately, the value cannot be directly retrieved by IMPORTXML with a xpath. But I noticed that the value you want is included in the JSON object in HTML. But, unfortunately, the data size is more than 50,000 characters. Ref In this case, the value cannot be retrieved by IMPORTXML.

So, in this answer, I would like to propose to achieve your goal using a custom function created by Google Apps Script.

Sample script:

Please copy and paste the following scripts to the script editor of Spreadsheet. And, please put =SAMPLE1("https://next.ibood.com/nl/nl/electronics") and =SAMPLE2("https://next.ibood.com/nl/nl/electronics") to the cells. By this, the values are retrieved.

function SAMPLE1(url) {
  const html = UrlFetchApp.fetch(url).getContentText();
  const str = html.match(/<script id=\"__NEXT_DATA__\" type=\"application\/json\">(.+?)<\/script>/);
  if (str.length == 2) {
    const obj = JSON.parse(str[1].trim());
    const value = obj.props.pageProps.initialReduxState.slotItems.mainOffer.imageSrc;
    if (value) {
      return `https:${value}`;
    }
  }
  return "No value";
}

function SAMPLE2(url) {
  const html = UrlFetchApp.fetch(url).getContentText();
  const str = html.match(/<script id=\"__NEXT_DATA__\" type=\"application\/json\">(.+?)<\/script>/);
  if (str.length == 2) {
    const obj = JSON.parse(str[1].trim());
    const value = obj.props.pageProps.initialReduxState.slotItems.offers.map(({imageSrc}) => `https:${imageSrc}`);
    if (value.length > 0) {
      return value;
    }
  }
  return "No value";
}

  • In this sample scripts, at first, the HTML is retrieved, and retrieve the JSON object, and return the values from the parsed JSON object.
  • When I saw the HTML data, I noticed that there are 2 kinds of the image sources. One is mainOffer (This value is obtained by SAMPLE1().). Another is offer (This value is obtained by SAMPLE2().). From your question, I thought that you might want the URL of mainOffer. But when I checked it by the browser, I noticed that there is the case that the image in the site is different from that of mainOffer. I'm not sure about the reason of this. So I proposed 2 patterns.

Result:

When above scripts are used, the following results are obtained.

References:

这篇关于Google 电子表格 ImportXML/XPath - 图像损坏的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆