Nokogiri 一次解析多个 XML 提要并按日期排序 [英] Nokogiri parsing multiple XML feeds at once and sort by date

查看:58
本文介绍了Nokogiri 一次解析多个 XML 提要并按日期排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Rails 和 Nokogiri 来解析一些 XML 提要.

I am using Rails and Nokogiri to parse some XML feeds.

我已经解析了一个 XML 提要,我想解析多个提要并按日期对项目进行排序.它们是 Wordpress 提要,因此它们具有相同的结构.

I have parsed one XML feed, and I want to parse multiple feeds and sort the items by date. They are Wordpress feeds so they have the same structure.

在我的控制器中,我有:

In my controller I have:

def index
  doc = Nokogiri::XML(open('http://somewordpressfeed'))  
  @content = doc.xpath('//item').map do |i| 
  {'title' => i.xpath('title').text, 'url' => i.xpath('link').text, 'date' => i.xpath('pubDate').text.to_datetime} 
  end
end

在我看来,我有:

<ul>
  <% @content.each do |l| %>
    <li><a href="<%= l['url'] %>"><%= l['title'] %></a> ( <%= time_ago_in_words(l['date']) %> )</li>
  <% end %>
</ul> 

上面的代码正常工作.我尝试解析多个提要并收到 404 错误:

The code above works as it should. I tried to parse mulitple feeds and got a 404 error:

  feeds = %w(wordpressfeed1, wordpressfeed2)
  docs = feeds.each { |d| Nokogiri::XML(open(d)) }

如何解析多个提要并将它们添加到哈希中,就像处理一个 XML 提要一样?我需要在页面加载时一次解析大约 50 个 XML 提要.

How do I parse multiple feeds and add them to a Hash like I do with one XML feed? I need to parse about fifty XML feeds at once on page load.

推荐答案

我会用不同的方式来写.

I'd write it all differently.

尝试更改 index 以接受 URL 数组,然后使用 map 遍历它们,将结果连接到一个数组中,然后返回:

Try changing index to accept an array of URLs, then loop over them using map, concatenating the results to an array, which you return:

def index(*urls)
  urls.map do |u|
    doc = Nokogiri::XML(open(u))  
    doc.xpath('//item').map do |i| 
      {
        'title' => i.xpath('title').text,
        'url' => i.xpath('link').text,
        'date' => i.xpath('pubDate').text.to_datetime
      } 
    end
  end
end

@content = index('url1', 'url2')

使用符号代替字符串作为哈希键会更像 Ruby:

It'd be more Ruby-like to use symbols instead of strings for your hash keys:

{
  :title => i.xpath('title').text,
  :url   => i.xpath('link').text,
  :date  => i.xpath('pubDate').text.to_datetime
} 

还有:

feeds = %w(wordpressfeed1, wordpressfeed2)
docs = feeds.each { |d| Nokogiri::XML(open(d)) }

each 是错误的迭代器.您需要 map 代替,它将返回所有解析的 DOM,并将它们分配给 docs.

each is the wrong iterator. You want map instead, which will return all the parsed DOMs, assigning them to docs.

这不会修复 404 错误,这是一个错误的 URL,并且是一个不同的问题.您没有正确定义数组:

This won't fix the 404 error, which is a bad URL, and is a different problem. You're not defining your array correctly:

%w(wordpressfeed1, wordpressfeed2)

应该是:

%w(wordpressfeed1 wordpressfeed2)

或:

['wordpressfeed1', 'wordpressfeed2']

<小时>

我正在重新访问此页面并注意到:

I was revisiting this page and noticed:

我需要在页面加载时一次解析大约 50 个 XML 提要.

I need to parse about fifty XML feeds at once on page load.

在处理从其他站点(尤其是其中 50 个站点)获取数据时,这完全是错误的处理方式.

This is completely, absolutely, the wrong way to go about handling the situation when dealing with grabbing data from other sites, especially fifty of them.

WordPress 网站通常有新闻(RSS 或 Atom)提要.提要中应该有一个参数,说明刷新页面的频率.尊重这个间隔,不要更频繁地访问他们的页面,尤其是当您将加载与 HTML 页面加载或刷新联系起来时.

WordPress sites typically have a news (RSS or Atom) feed. There should be a parameter in the feed stating how often its OK to refresh the page. HONOR that interval and don't hit their page more often than that, especially when you are tying your load to a HTML page load or refresh.

原因有很多,但可以归结为不要这样做",以免您被禁止.如果不出意外,使用网页刷新对您的站点进行 DOS 攻击将是微不足道的,结果会击败他们的站点,而这两个站点都不是您的优秀 Web 开发人员.你首先保护自己,他们会继承这一点.

There are many reasons why, but it breaks down to "just don't do it" lest you get banned. If nothing else, it'd be trivial to commit a DOS attack on your site using web-page refreshes, and it'd be beating their sites as a result, neither of which is being a good web-developer on your part. You protect yourself first, and they inherit from that.

那么,当您想要获得 50 个站点并获得快速响应而不是击败其他站点时,您会怎么做?您将数据缓存在数据库中,然后在加载或刷新页面时从中读取数据.而且,在后台,您还有一项任务会定期启动以扫描其他网站,同时遵守其刷新率.

So, what do you do when you want to get fifty sites and have fast response and not beat up other sites? You cache the data in a database, and then read from that when your page is loaded or refreshed. And, in the background you have another task that fires off periodically to scan the other sites, while honoring their refresh rates.

这篇关于Nokogiri 一次解析多个 XML 提要并按日期排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆