如何检测页面是 RSS 还是 ATOM 提要 [英] How to detect if a page is an RSS or ATOM feed
问题描述
我目前正在用 PHP 构建一个新的在线订阅源阅读器.我正在研究的功能之一是提要自动发现.如果用户输入网站 URL,脚本将检测到它不是一个提要,并通过解析 HTML 以获取正确的 <link>
标记来查找真正的提要 URL.
I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link>
tag.
问题是,我目前检测 URL 是供稿还是网站的方式仅在部分时间有效,我知道这不是最佳解决方案.现在我正在获取 CURL 响应并通过 simplexml_load_string
运行它,如果它无法解析它,我将它视为一个网站.这是代码.
The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string
, if it can't parse it I treat it as a website. Here is the code.
$xml = @simplexml_load_string( $site_found['content'] );
if( !$xml ) // this is a website, not a feed
{
// handle website
}
else
{
// parse feed
}
显然,这并不理想.此外,当它遇到一个可以解析的 HTML 网站时,它会认为它是一个提要.
Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.
关于在 PHP 中检测提要或非提要之间差异的好方法有什么建议吗?
Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?
推荐答案
我会嗅探这些格式具有的各种唯一标识符:
I would sniff for the various unique identifiers those formats have:
Atom:来源
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
RSS 0.90:来源
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">
网景 RSS 0.91
Netscape RSS 0.91
<rss version="0.91">
等等.等(有关完整概述,请参阅第二个源链接).
etc. etc. (See the 2nd source link for a full overview).
据我所知,通过分别查找
和
标签,分离 Atom 和 RSS 应该非常容易.此外,您不会在有效的 HTML 文档中找到这些内容.
As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed>
and <rss>
tags, respectively. Plus you won't find those in a valid HTML document.
您可以通过首先查找 和
元素来进行初步检查,以区分 HTML 和提要.为了避免无效输入的问题,这可能是使用正则表达式(通过解析器)最终合理的情况 一次 :)
You could make an initial check to tell HTML and feeds apart by looking for <html>
and <body>
elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)
如果它与 HTML 测试不匹配,请对其运行 Atom/RSS 测试.如果它未被识别为提要,或者 XML 解析器因无效输入而阻塞,请再次返回 HTML.
If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.
实际情况如何——Feed 提供者是否始终遵守这些规则——是一个不同的问题,但您应该已经能够通过这种方式识别出很多东西.
what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.
这篇关于如何检测页面是 RSS 还是 ATOM 提要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!