如何检测页面是 RSS 还是 ATOM 提要 [英] How to detect if a page is an RSS or ATOM feed

查看:48
本文介绍了如何检测页面是 RSS 还是 ATOM 提要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在用 PHP 构建一个新的在线订阅源阅读器.我正在研究的功能之一是提要自动发现.如果用户输入网站 URL,脚本将检测到它不是一个提要,并通过解析 HTML 以获取正确的 <link> 标记来查找真正的提要 URL.

I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link> tag.

问题是,我目前检测 URL 是供稿还是网站的方式仅在部分时间有效,我知道这不是最佳解决方案.现在我正在获取 CURL 响应并通过 simplexml_load_string 运行它,如果它无法解析它,我将它视为一个网站.这是代码.

The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.

$xml = @simplexml_load_string( $site_found['content'] );

if( !$xml ) // this is a website, not a feed
{
    // handle website
}
else
{
    // parse feed
}

显然,这并不理想.此外,当它遇到一个可以解析的 HTML 网站时,它会认为它是一个提要.

Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.

关于在 PHP 中检测提要或非提要之间差异的好方法有什么建议吗?

Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?

推荐答案

我会嗅探这些格式具有的各种唯一标识符:

I would sniff for the various unique identifiers those formats have:

Atom:来源

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

RSS 0.90:来源

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">

网景 RSS 0.91

Netscape RSS 0.91

<rss version="0.91">

等等.等(有关完整概述,请参阅第二个源链接).

etc. etc. (See the 2nd source link for a full overview).

据我所知,通过分别查找 标签,分离 Atom 和 RSS 应该非常容易.此外,您不会在有效的 HTML 文档中找到这些内容.

As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.

您可以通过首先查找 元素来进行初步检查,以区分 HTML 和提要.为了避免无效输入的问题,这可能是使用正则表达式(通过解析器)最终合理的情况 一次 :)

You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)

如果它与 HTML 测试不匹配,请对其运行 Atom/RSS 测​​试.如果它未被识别为提要,或者 XML 解析器因无效输入而阻塞,请再次返回 HTML.

If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.

实际情况如何——Feed 提供者是否始终遵守这些规则——是一个不同的问题,但您应该已经能够通过这种方式识别出很多东西.

what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.

这篇关于如何检测页面是 RSS 还是 ATOM 提要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆