Google 电子表格 ImportXML/XPath - 图像损坏的输出 [英] Google Spreadsheets ImportXML / XPath - Image broken output
问题描述
我正在尝试使用 XPath 通过 Google 表格中的 ImportXML 函数抓取我网站的图像.该功能似乎有效,但输出似乎已损坏.我没有获取图像的 URL,而是收到以下损坏的输出:
data:image/svg+xml,%3Csvg xmlns=http://www.w3.org/2000/svg"视图框=0 0520 390"%3E%3C/svg%3E
我正在使用以下 XPath:
//div[contains(@class,'MuiGrid-root-59 jss358 MuiGrid-item-61')]//img/@src
我尝试了几件事,但现在有点卡住了.有谁知道如何处理(和解决)这个编码错误?
编辑 I:在此也是 Google 表格的链接:
参考:
I am trying to scrape an image of my website via the ImportXML function in Google Sheets using XPath. The function seems to work, however the output seems to be broken. Instead of getting the URL of the image, I am receiving the following broken output:
data:image/svg+xml,%3Csvg xmlns="http://www.w3.org/2000/svg" viewBox="0 0
520 390"%3E%3C/svg%3E
I am using the following XPath:
//div[contains(@class,'MuiGrid-root-59 jss358 MuiGrid-item-61')]//img/@src
I tried several things, but I am bit stuck now. Does anyone know how to deal (and solve) with this encoding error?
Edit I: hereby also the link to the Google sheet: example image broken in Google Sheets
*Edit II: I've noted that the image is changing, so the above XPath is not working anymore
When I saw the HTML of https://next.ibood.com/nl/nl/electronics
, it seems that the value you want to retrieve is created by Javascript. So, unfortunately, the value cannot be directly retrieved by IMPORTXML
with a xpath. But I noticed that the value you want is included in the JSON object in HTML. But, unfortunately, the data size is more than 50,000 characters. Ref In this case, the value cannot be retrieved by IMPORTXML
.
So, in this answer, I would like to propose to achieve your goal using a custom function created by Google Apps Script.
Sample script:
Please copy and paste the following scripts to the script editor of Spreadsheet. And, please put =SAMPLE1("https://next.ibood.com/nl/nl/electronics")
and =SAMPLE2("https://next.ibood.com/nl/nl/electronics")
to the cells. By this, the values are retrieved.
function SAMPLE1(url) {
const html = UrlFetchApp.fetch(url).getContentText();
const str = html.match(/<script id=\"__NEXT_DATA__\" type=\"application\/json\">(.+?)<\/script>/);
if (str.length == 2) {
const obj = JSON.parse(str[1].trim());
const value = obj.props.pageProps.initialReduxState.slotItems.mainOffer.imageSrc;
if (value) {
return `https:${value}`;
}
}
return "No value";
}
function SAMPLE2(url) {
const html = UrlFetchApp.fetch(url).getContentText();
const str = html.match(/<script id=\"__NEXT_DATA__\" type=\"application\/json\">(.+?)<\/script>/);
if (str.length == 2) {
const obj = JSON.parse(str[1].trim());
const value = obj.props.pageProps.initialReduxState.slotItems.offers.map(({imageSrc}) => `https:${imageSrc}`);
if (value.length > 0) {
return value;
}
}
return "No value";
}
- In this sample scripts, at first, the HTML is retrieved, and retrieve the JSON object, and return the values from the parsed JSON object.
- When I saw the HTML data, I noticed that there are 2 kinds of the image sources. One is
mainOffer
(This value is obtained bySAMPLE1()
.). Another isoffer
(This value is obtained bySAMPLE2()
.). From your question, I thought that you might want the URL ofmainOffer
. But when I checked it by the browser, I noticed that there is the case that the image in the site is different from that ofmainOffer
. I'm not sure about the reason of this. So I proposed 2 patterns.
Result:
When above scripts are used, the following results are obtained.
References:
这篇关于Google 电子表格 ImportXML/XPath - 图像损坏的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!