IMPORTHTML/IMPORTXML 突然返回“未找到 URL 处的资源" [英] IMPORTHTML/IMPORTXML suddenly returning 'resource at URL not found'

查看:23
本文介绍了IMPORTHTML/IMPORTXML 突然返回“未找到 URL 处的资源"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 googlesheets 中的 importhtml 和 importxml 函数突然停止工作.我猜它与 URL 有关系,因为它已经工作了大约一个月(因为我设置了它)并且最近才停止(上周).不幸的是,现在有一半的工作表显示 #N/A.只是想知道是否有在这方面更先进的人可以分享他们对这个问题的见解.

My importhtml and importxml functions in googlesheets suddenly stopped working. I'm guessing it has something to do with the URL as it has been working for about a month (since I set it up) and has only recently stopped (last week). Now half the sheets is displaying #N/A unfortunately. Just wondering if someone more advanced in this area could share their insight into the problem.

IMPORTXML 示例:

网址:https://au.finance.yahoo.com/quote/AZJ.AX/profile?p=AZJ.AX

XML://*[@id='Col1-0-Profile-Proxy']/section/div[1]/div/div/p[2]/span[4]

XML: //*[@id='Col1-0-Profile-Proxy']/section/div[1]/div/div/p[2]/span[4]

函数返回未找到 URL 上的资源".

Function returning "Resource at URL not found".

这是一个示例文件的链接:https://docs.google.com/spreadsheets/d/1ml3TCXOMGYPCYP/1ml3TCXOMGYPCYpC编辑?usp=共享

Here is a link to an example file: https://docs.google.com/spreadsheets/d/1ml3cCHIEk4vTCn3GyymXGxyCpVBgMprMzGP7OYhKPC0/edit?usp=sharing

感谢任何花时间调查问题的人.谢谢!

Appreciate anyone taking the time to look into the problem. Thank you!

推荐答案

这看起来实际上可能是 Google 方面的问题:

This looks like it may in fact be a issue on Google's side:

https://issuetracker.google.com/175144626

还没有任何解决方案,但如果它影响到您或您想要更新,请在该问题上加星标.

There has not been any resolution, but go and star that issue if it affects you or you want updates.

我没有经常使用 Yahoo 站点,但也许它更改了格式,使其对 IMPORTXML 不可读.我还用 IMPORTHTML 对其进行了测试,结果相同.不过,该功能本身仍可用于其他网站.

I have not used the Yahoo site a lot, but maybe it has changed its format in a way that makes it unreadable for IMPORTXML. I also tested it with IMPORTHTML, with the same result. The function itself is still working with other sites though.

测试您提供的站点,禁用 JavaScript 以确保数据不是动态生成的(这使得 IMPORTHTML 无法读取数据),实际上有一个表:

Testing the site you gave, disabling JavaScript to ensure the data is not dynamically generated (which makes it unreadable for IMPORTHTML), there is in fact a table:

<table class="W(100%)" data-reactid="36">
   <thead data-reactid="37">
      <tr class="C($tertiaryColor) Fz(xs) BdB Bdc($seperatorColor)" data-reactid="38">
         <th class="Ta(start) Py(6px) Fw(n)" data-reactid="39"><span data-reactid="40">Name</span></th>
         <th class="Ta(start) Py(6px) Fw(n)" data-reactid="41"><span data-reactid="42">Title</span></th>
         <th class="Ta(end) Py(6px) Fw(n)" data-reactid="43"><span data-reactid="44">Pay</span></th>
         <th class="Ta(end) Py(6px) Fw(n)" data-reactid="45"><span data-reactid="46">Exercised</span></th>
         <th class="Ta(end) Py(6px) Fw(n)" data-reactid="47"><span data-reactid="48">Year born</span></th>
      </tr>
   </thead>
   <tbody data-reactid="49">
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="50">
         <td class="Ta(start)" data-reactid="51">
            <span class="" data-reactid="52">
               <!-- react-text: 53 -->Mr. Andrew T. Harding<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="54">
            <span class="" data-reactid="55">
               <!-- react-text: 56 -->MD, CEO &amp; Director<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="57">
            <span class="" data-reactid="58">
               <!-- react-text: 59 -->2.78M<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="60"><span class="" data-reactid="61"><span data-reactid="62">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="63">
            <span class="" data-reactid="64">
               <!-- react-text: 65 -->1967<!-- /react-text -->
            </span>
         </td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="66">
         <td class="Ta(start)" data-reactid="67">
            <span class="" data-reactid="68">
               <!-- react-text: 69 -->Mr. George  Lippiatt<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="70">
            <span class="" data-reactid="71">
               <!-- react-text: 72 -->CFO &amp; Group Exec. Strategy<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="73">
            <span class="" data-reactid="74">
               <!-- react-text: 75 -->481.25k<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="76"><span class="" data-reactid="77"><span data-reactid="78">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="79"><span class="" data-reactid="80"><span data-reactid="81">N/A</span></span></td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="82">
         <td class="Ta(start)" data-reactid="83">
            <span class="" data-reactid="84">
               <!-- react-text: 85 -->Mr. Ed  McKeiver<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="86">
            <span class="" data-reactid="87">
               <!-- react-text: 88 -->Group Exec. of Coal<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="89">
            <span class="" data-reactid="90">
               <!-- react-text: 91 -->861k<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="92"><span class="" data-reactid="93"><span data-reactid="94">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="95"><span class="" data-reactid="96"><span data-reactid="97">N/A</span></span></td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="98">
         <td class="Ta(start)" data-reactid="99">
            <span class="" data-reactid="100">
               <!-- react-text: 101 -->Mr. Michael G. Carter BEng, BBus, MAICD, AFAIM<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="102">
            <span class="" data-reactid="103">
               <!-- react-text: 104 -->Group Exec. of Technical Services &amp; Planning<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="105"><span class="" data-reactid="106"><span data-reactid="107">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="108"><span class="" data-reactid="109"><span data-reactid="110">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="111"><span class="" data-reactid="112"><span data-reactid="113">N/A</span></span></td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="114">
         <td class="Ta(start)" data-reactid="115">
            <span class="" data-reactid="116">
               <!-- react-text: 117 -->Mr. Christopher L. Vagg<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="118">
            <span class="" data-reactid="119">
               <!-- react-text: 120 -->Head of Investor Relations &amp; Group Treasurer<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="121"><span class="" data-reactid="122"><span data-reactid="123">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="124"><span class="" data-reactid="125"><span data-reactid="126">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="127"><span class="" data-reactid="128"><span data-reactid="129">N/A</span></span></td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="130">
         <td class="Ta(start)" data-reactid="131">
            <span class="" data-reactid="132">
               <!-- react-text: 133 -->Mark  Hairsine<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="134">
            <span class="" data-reactid="135">
               <!-- react-text: 136 -->Mang. of Media &amp; Communications<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="137"><span class="" data-reactid="138"><span data-reactid="139">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="140"><span class="" data-reactid="141"><span data-reactid="142">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="143"><span class="" data-reactid="144"><span data-reactid="145">N/A</span></span></td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="146">
         <td class="Ta(start)" data-reactid="147">
            <span class="" data-reactid="148">
               <!-- react-text: 149 -->Ms. Tina  Thomas<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="150">
            <span class="" data-reactid="151">
               <!-- react-text: 152 -->Group Exec. of Corp.<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="153"><span class="" data-reactid="154"><span data-reactid="155">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="156"><span class="" data-reactid="157"><span data-reactid="158">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="159"><span class="" data-reactid="160"><span data-reactid="161">N/A</span></span></td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="162">
         <td class="Ta(start)" data-reactid="163">
            <span class="" data-reactid="164">
               <!-- react-text: 165 -->James  Coe<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="166">
            <span class="" data-reactid="167">
               <!-- react-text: 168 -->Mang. of Market Intelligence &amp; Sustainability<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="169"><span class="" data-reactid="170"><span data-reactid="171">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="172"><span class="" data-reactid="173"><span data-reactid="174">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="175"><span class="" data-reactid="176"><span data-reactid="177">N/A</span></span></td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="178">
         <td class="Ta(start)" data-reactid="179">
            <span class="" data-reactid="180">
               <!-- react-text: 181 -->Mr. Dominic Dupont Smith<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="182">
            <span class="" data-reactid="183">
               <!-- react-text: 184 -->Company Sec.<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="185"><span class="" data-reactid="186"><span data-reactid="187">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="188"><span class="" data-reactid="189"><span data-reactid="190">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="191">
            <span class="" data-reactid="192">
               <!-- react-text: 193 -->1964<!-- /react-text -->
            </span>
         </td>
      </tr>
      <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="194">
         <td class="Ta(start)" data-reactid="195">
            <span class="" data-reactid="196">
               <!-- react-text: 197 -->Lowana  Riddiford<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(start) W(45%)" data-reactid="198">
            <span class="" data-reactid="199">
               <!-- react-text: 200 -->Warehouse Logistics Coordinator<!-- /react-text -->
            </span>
         </td>
         <td class="Ta(end)" data-reactid="201"><span class="" data-reactid="202"><span data-reactid="203">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="204"><span class="" data-reactid="205"><span data-reactid="206">N/A</span></span></td>
         <td class="Ta(end)" data-reactid="207"><span class="" data-reactid="208"><span data-reactid="209">N/A</span></span></td>
      </tr>
   </tbody>
</table>

每个 中都有一个 元素的事实不应该影响 IMPORTXML/IMPORTHTML,因为在其他站点中它可以拾取文本 标签内的内容没问题.

The fact that there is a <span> element within each <td> should not affect IMPORTXML / IMPORTHTML as in other sites it can pick up the text content within <a> tags no problem.

然而,对于 react 1964 的评论可能会导致电子表格函数得到迷茫.

However the comments for react <!-- react-text: 193 -->1964<!-- /react-text --> might be causing the Spreadsheet functions to get confused.

他们的 robots.txt 文件似乎也不包含任何线索:

Their robots.txt file doesn't seem to contain any clues either:

User-agent: *
Sitemap: https://au.finance.yahoo.com/sitemaps/finance-sitemap_index_AU_en-AU.xml.gz
Disallow: /r/
Disallow: /__rapidworker-1.2.js
Disallow: /__blank
Disallow: /_td_api
Disallow: /_remote

Google 的用户代理是 Mozilla/5.0(兼容;GoogleDocs;apps-spreadsheets;http://docs.google.com)

Google's user agent is Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

所以我不能确定是什么网站导致它出错,但这很可能是雅虎方面的变化.也许他们对 Google 的 IP 地址进行了速率限制

So I can't say for sure what about the site makes it go wrong, but it is likely a change on the Yahoo side of things. Maybe they have rate limited Google's IP addresses

我相信您正在突破简单的电子表格功能的极限.这很好,不要误会我的意思,但是,使用真正不适合大量抓取的工具进行抓取并不是很可靠.考虑到网络抓取也不是 100% 可靠的信息来源,因为它不断受到抓取网站更改其 HTML 结构或让内容由 JavaScript 动态生成的影响.事情最终会破裂.

I believe you are pushing the limits of what is possible with humble Spreadsheet functions. Which is great, don't get me wrong, however, scraping with a tool that is really not meant for heavy scraping is not very reliable. Considering that web scraping is also not 100% reliable as a source of information as it is constantly subject to the scraped sites changing their HTML structure or leaving things to be dynamically generated by JavaScript. Things will eventually break.

使用 UrlFetchApp 在 Apps Script 中,您可以更灵活地控制如何抓取数据.

With UrlFetchApp within Apps Script you have a lot more control and flexibility over how to scrape your data.

尽管这对您来说是一项需要可靠性的严肃活动,但使用 API 是值得的.APIS 是设计供网站、工作表和数据库从中获取数据的.他们欢迎它.还有一些免费产品.

Though if this is a serious activity for you which requires dependability, it will pay off to use an API. APISs are designed for websites and sheets and databases to fetch data from them. They welcome it. There are some free offerings about too.

这篇关于IMPORTHTML/IMPORTXML 突然返回“未找到 URL 处的资源"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆