使用 YQL 以尽可能低的资源使用率(即最少的查询次数)执行图像抓取 [英] Performing image scraping using YQL with lowest resources usage possible i.e. lowest number of queries

查看:21
本文介绍了使用 YQL 以尽可能低的资源使用率(即最少的查询次数)执行图像抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行一些图像抓取工具,该工具使用户能够使用 xpath 处理给定页面中包含的所有图像抓取图像,以查找哪些具有 alt 标签,哪些没有,并将结果作为两个返回单独的 json 对象

I am trying to perform some image scraping tool which enables the user to scrape all the images contained within a given page using xpath process the scraped images to find which have an alt tags and which doesn't and return the result as two separate json objects

{alted:["",""],nonAlted:["

i.e. {alted:["<img ......>","<img ......>"],nonAlted:["<img ......>","<img ......>"]}

现在我的问题来了,虽然我能够抓取页面并检索所有图像并将它们分为alted和nonAlted类别,但我不能将它们放入响应对象中!

now comes my problem, although i am able to scrape the page and retrieve all the images and separate them to the alted and nonAlted categories i can't put them in the response object !

我认为为了进一步澄清我的问题,最好添加一些代码,因此以下代码是我在 YQL 表的执行块中使用的代码:

I think to further clarify my issue it would be better to add some code, so the following code is what i use in the execute block of my YQL table:

query = "select * from html where url='http://www.example.com/page-path' and xpath='//li'";
var result = y.query(query);

y.log(result.results..img.(@alt));

var querieselement = <urls/>; 
querieselement.query = result.results..img.(@alt);

response.object = querieselement;

所以我的问题是如何设置响应对象以包含已处理的图像列表,请注意,运行查询后,尽管日志显示了列表,但结果并未显示任何数据,希望有人能指点我到那个问题的原因.

So my question is how can i set the response object to contain the processed list of the images, note that after running the query the result doesn't show any data although the log is showing the list, hope someone can point me to the cause of that problem.

P.S. 我提到资源使用"的原因是在标题中是因为我知道能够为每个图像类别执行单独的调用,这意味着将同一页面抓取两次,我认为这有点低效.

P.S. The reason i mentioned "resources usage" in the title is that because i am aware of the ability to perform to separate calls for each images category which means scraping the same page two times which i think is kind of inefficient.

P.S.如果有人能帮助我理解这两行的含义,我也会很高兴

P.S. i would also be glad if someone can help me understand what is the meaning of those two lines

querieselement = <urls/>;
querieselement.query = result.results..img.(@alt);

为什么"以及为什么querieselement.query",我不知道他们应该做什么,而他们似乎在做关键工作,因为更改它们会破坏代码.

why "<urls/>" and why "querieselement.query", i don't know what they are supposed to do while they seem to be doing critical job as changing them breaks the code.

谢谢.

推荐答案

所以我的问题是如何设置响应对象以包含已处理的图像列表

So my question is how can i set the response object to contain the processed list of the images

使用样式表而不是 XPath 选择器:

Use a stylesheet rather than an XPath selector:

 select * from xslt where url="http://www.mysite.com/page-path" and stylesheet="http://www.mysite.com/page-path.xsl"

如下定义样式表:

  <xsl:template match="img[@alt]">
    <xsl:for-each select="@alt">
      <script>
        alt.push(<xsl:value-of select="."/>);
      </script>
    </xsl:for-each>
  </xsl:template>

  <xsl:template match="img[not(@alt)]">
    <xsl:for-each select="@src">
      <script>
        noalt.push(<xsl:value-of select="."/>);
      </script>
    </xsl:for-each>
  </xsl:template>

这篇关于使用 YQL 以尽可能低的资源使用率(即最少的查询次数)执行图像抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆