使用Google-apps-script从Google搜索中抓取时出现错误429 [英] Error 429 on scraping from Google search with Google-apps-script
问题描述
我想获取某些域的索引页数.因此,我想使用"site:"参数并从搜索结果页面提取结果数.
I want to get the number of indexed pages for certain domains. Therefore I want to use the "site:" parameter and extract the number of results from the search result page.
我用Google电子表格的Google-apps-script进行了尝试:
I tried it with a Google-apps-script for Google spreadsheets:
function sampleFormula_4() {
const url = "https://www.google.com/search?q=site%3Abenedikt-sahlmueller.de";
try {
const html = UrlFetchApp.fetch(url).getContentText();
return html.match(/<div id="result-stats">(.+?)nobr>/)[1].trim();
} catch (e) {
Utilities.sleep(5000);
const html = UrlFetchApp.fetch(url).getContentText();
return html.match(/<div id="result-stats">(.+?)nobr>/)[1].trim();
}
}
Google Spreadsheet给我一个错误429-请求太多.我整合了5000毫秒的睡眠时间,但Google搜索仍然返回错误429.
Google Spreadsheet gives me an error 429 - too many requests. I integrated a sleep-time of 5000ms, but Google Search still returns error 429.
我需要的是Google搜索结果中某些URL的页面数.也许有更好的方法-我不能为此使用search-api,因为这些页面不属于我的GSC.
All I need is the number of pages for certain URLs in Google's search results. Maybe there is a better way - I can't use the search-api for this as those pages are not part of my GSC.
推荐答案
Google搜索很可能会将来自 UrlFetch
的请求视为自动流量,因此将其阻止.来自官方文档:
Most likely Google Search is considering requests coming from UrlFetch
as automated traffic and hence blocking them. From the official docs:
Google认为自动流量
- 从机器人,计算机程序,自动化服务或搜索刮板发送搜索
例如,使用诸如 wget
或 curl
之类的工具时,也会发生相同的行为.
The same behaviour happens when using tools like wget
or curl
, for example.
建议使用搜索API .
这篇关于使用Google-apps-script从Google搜索中抓取时出现错误429的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!