使用YQL进行图像抓取,并尽可能减少资源使用,即查询数量最少 [英] Performing image scrapping using YQL with lowest resources usage possible i.e. lowest number of queries

查看:80
本文介绍了使用YQL进行图像抓取,并尽可能减少资源使用,即查询数量最少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行一些图像抓取工具,该工具使用户可以使用xpath来抓取给定页面中包含的所有图像,对所抓取的图像进行查找,以找到具有alt标记和没有alt标记的结果,并将结果返回为两个单独的json对象

I am trying to perform some image scrapping tool which enables the user to scrap all the images contained within a given page using xpath process the scrapped images to find which have an alt tags and which doesn't and return the result as two separate json objects

{alted:[","],nonAlted:[","]}

现在出现了我的问题,尽管我能够抓取页面并检索所有图像并将它们分为alted和nonAlted类别,但我无法将它们放在响应对象中!

now comes my problem, although i am able to scrap the page and retrieve all the images and separate them to the alted and nonAlted categories i can't put them in the response object !

我想进一步澄清我的问题,最好添加一些代码,因此以下代码是我在YQL表的execute块中使用的代码:

I think to further clarify my issue it would be better to add some code, so the following code is what i use in the execute block of my YQL table:

query = "select * from html where url='http://www.mysite.com/page-path' and xpath='//li'";
var result = y.query(query);

y.log(result.results..img.(@alt));

var querieselement = <urls/>; 
querieselement.query = result.results..img.(@alt);

response.object = querieselement;

所以我的问题是我如何设置响应对象以包含图像的已处理列表,请注意,运行查询后,尽管日志显示了列表,但结果未显示任何数据,希望有人可以指向我造成该问题的原因.

So my question is how can i set the response object to contain the processed list of the images, note that after running the query the result doesn't show any data although the log is showing the list, hope someone can point me to the cause of that problem.

PS 之所以在标题中提到资源使用情况",是因为我知道能够为每个图像类别执行单独调用的能力,这意味着将同一页面抓取两次认为效率低下.

P.S. The reason i mentioned "resources usage" in the title is that because i am aware of the ability to perform to separate calls for each images category which means scrapping the same page two times which i think is kind of inefficient.

附注,如果有人可以帮助我理解这两行的含义,我也将很高兴

P.S. i would also be glad if someone can help me understand what is the meaning of those two lines

querieselement = <urls/>;
querieselement.query = result.results..img.(@alt);

为什么"< urls/> "以及为什么" querieselement.query ",我不知道他们似乎应该做什么?做关键工作,因为更改代码会破坏代码.

why "<urls/>" and why "querieselement.query", i don't know what they are supposed to do while they seem to be doing critical job as changing them breaks the code.

谢谢.

推荐答案

所以我的问题是如何设置响应对象以包含图像的已处理列表

So my question is how can i set the response object to contain the processed list of the images

使用样式表而不是XPath选择器:

Use a stylesheet rather than an XPath selector:

 select * from xslt where url="http://www.mysite.com/page-path" and stylesheet="http://www.mysite.com/page-path.xsl"

这样定义样式表:

  <xsl:template match="img[@alt]">
    <xsl:for-each select="@alt">
      <script>
        alt.push(<xsl:value-of select="."/>);
      </script>
    </xsl:for-each>
  </xsl:template>

  <xsl:template match="img[not(@alt)]">
    <xsl:for-each select="@src">
      <script>
        noalt.push(<xsl:value-of select="."/>);
      </script>
    </xsl:for-each>
  </xsl:template>

这篇关于使用YQL进行图像抓取,并尽可能减少资源使用,即查询数量最少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆