PHP XMLReader,获取版本和编码 [英] PHP XMLReader, get the version and encoding

查看:18
本文介绍了PHP XMLReader,获取版本和编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在重写一个 PHP 类,该类试图将 XML 文件拆分为更小的块,以使用 XMLReader 和 XMLWriter 而不是当前的基本文件系统和正则表达式方法.

I'm currently rewriting a PHP class that tried to split an XML file into smaller chunks to use XMLReader and XMLWriter instead of the current basic filesystem and regex approach.

但是,我不知道如何从 XML 序言中获取版本、编码和独立标志.

However, I can't figure out how to get the version, encoding and standalone flags from the XML preamble.

我的测试 XML 文件的开头如下所示:

The start of my test XML file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE fakedoctype SYSTEM "fake_doc_type.dtd">

 <!--
 This is a comment, it's here to try and get the parser to break in some way
 --> 

<root attribute="value" otherattribute="othervalue">

我可以用阅读器打开它并使用 read()、next() 等浏览文档,但我似乎无法获得 <?xml ... ?> 中的任何内容..我能够访问的第一件事是伪造的 DOCTYPE.

I can open it okay with the reader and move through the document with read(), next() etc, but I just can't seem to get whatever's in <?xml ... ?>. The first thing I'm able to access is the fake DOCTYPE.

我的测试代码如下:

$a = new XMLReader ();
var_dump ($a -> open ('/path/to/test/file.xml')) // true
var_dump ($a -> nodeType); // 0
var_dump ($a -> name); // ""
var_dump ($a -> readOuterXML ()); // ''
var_dump ($a -> read ()); // true
var_dump ($a -> nodeType); // 10
var_dump ($a -> readOuterXML ()); // <!DOCTYPE fakedoctype SYSTEM "fake_doc_type.dtd">

当然,我总是可以假设 XML 1.0,编码 UTF8 和独立 = 是,但为了正确起见,我真的宁愿能够获取源提要中的值并在生成拆分时使用它们文件.

Of course I could just always assume XML 1.0, encoding UTF8 and standalone = yes, but for the sake of correctness I'd really rather be able to grab what the values in my source feed are and use them when generating the split files.

关于 XMLReader 和 XMLwriter 的文档似乎很差,所以我很有可能只是错过了文档中的某些内容.有谁知道在这种情况下该怎么做?

The documentation on XMLReader and XMLwriter seems to be very poor, so there's every chance I've just missed something in the docs. Does anyone know what to do in this case?

推荐答案

我从 XMLReader 知道的即使它有 XMLReader::XML_DECLARATION 常量,我从来没有经历过当使用 XMLReader::$nodeType 属性中的 XMLReader::read() 遍历文档时.

What I know from XMLReader even it has the XMLReader::XML_DECLARATION constant, I have never experienced it when traversing the document with XMLReader::read() in the XMLReader::$nodeType property.

看起来它被跳过了,我也想知道为什么会这样,我还没有找到任何标志或选项来改变这种行为.

It looks like that it gets skipped and I also wondered why this is and I have not yet found any flag or option to change this behavior.

对于输出,XMLReader 总是返回 UTF-8 编码的字符串.这与 PHP 中其他基于 libxml 的部分相同.所以从那一边,一切都清楚了.但我认为这不是您感兴趣的部分,而是您使用 XMLReader::open() 打开的文件中输入的具体字符串.

For the output, XMLReader always returns UTF-8 encoded strings. That's the same as with the other libxml based parts in PHP. So from that side, all is clear. But I assume that is not the part you're interested in, but the concrete string input in the file you open with XMLReader::open().

不是专门针对 XMLReader 我曾经创建了一个 我命名为 的实用程序类XMLRecoder 能够根据 XML 声明和 BOM 检测 XML 字符串的编码.我认为你应该两者都做.这是我认为您仍然需要使用正则表达式的一部分,但因为 XML 声明必须是第一件事,而且它是一个处理指令 (PI),即 定义得非常好且严格你应该可以在那里偷看.

Not specifically for XMLReader I once created a utility class I named XMLRecoder which is able to detect the encoding of an XML string based on the XML declaration and also based on BOM. I think you should do both. That's one part I think you still need to use regular expressions for but as the XML declaration must be the first thing and also it is a processing instruction (PI) that is very well and strict defined you should be able to peek in there.

这是来自 XMLRecoder 代码的一些相关部分:

This is some related part from the XMLRecoder code:

### excerpt from https://gist.github.com/hakre/5194634 

/**
 * pcre pattern to access EncodingDecl, see <http://www.w3.org/TR/REC-xml/#sec-prolog-dtd>
 */
const DECL_PATTERN = '(^<\?xml\s+version\s*=\s*(["\'])(1\.\d+)\1\s+encoding\s*=\s*(["\'])(((?!\3).)*)\3)';
const DECL_ENC_GROUP = 4;
const ENC_PATTERN = '(^[A-Za-z][A-Za-z0-9._-]*$)';

...

($result = preg_match(self::DECL_PATTERN, $buffer, $matches, PREG_OFFSET_CAPTURE))
    && $result = $matches[self::DECL_ENC_GROUP];

正如这表明它一直持续到编码,所以它不完整.但是,对于提取编码的需求(以及您的需求版本),它应该可以完成这项工作.我已经对大量(数千)随机 XML 文档进行了测试.

As this shows it goes until encoding, so it's not complete. However for the needs to extract encoding (and for your needs version), it should do the job. I had run this against a tons (thousands) of random XML documents for testing.

另一部分是BOM检测:

Another part is the BOM detection:

### excerpt from https://gist.github.com/hakre/5194634 

const BOM_UTF_8 = "\xEF\xBB\xBF";
const BOM_UTF_32LE = "\xFF\xFE\x00\x00";
const BOM_UTF_16LE = "\xFF\xFE";
const BOM_UTF_32BE = "\x00\x00\xFE\xFF";
const BOM_UTF_16BE = "\xFE\xFF";

...

/**
 * @param string $string string (recommended length 4 characters/octets)
 * @param string $default (optional) if none detected what to return
 * @return string Encoding, if it can not be detected defaults $default (NULL)
 * @throws InvalidArgumentException
 */
public function detectEncodingViaBom($string, $default = NULL)
{
    $len = strlen($string);

    if ($len > 4) {
        $string = substr($string, 0, 4);
    } elseif ($len < 4) {
        throw new InvalidArgumentException(sprintf("Need at least four characters, %d given.", $len));
    }

    switch (true) {
        case $string === self::BOM_UTF_16BE . $string[2] . $string[3]:
            return "UTF-16BE";

        case $string === self::BOM_UTF_8 . $string[3]:
            return "UTF-8";

        case $string === self::BOM_UTF_32LE:
            return "UTF-32LE";

        case $string === self::BOM_UTF_16LE . $string[2] . $string[3]:
            return "UTF-16LE";

        case $string === self::BOM_UTF_32BE:
            return "UTF-32BE";
    }

    return $default;
}

对于 BOM 检测,我也确实针对同一组 XML 文档运行了此检测,但是,使用 BOM 的并不多.如您所见,检测顺序针对更常见的场景进行了优化,同时处理了不同 BOM 之间的重复二进制模式.我遇到的大多数文档都没有 BOM,您主要需要它来确定文档是否为 UTF-32 编码.

With the BOM detection I also did run this against the same set of XML documents, however, not many were with BOMs. As you can see, the detection order is optimized for the more common scenarios while taking care of the duplicate binary patterns between the different BOMs. Most documents I encountered are w/o BOM and you mainly need it to find out if the document is UTF-32 encoded.

希望这至少能提供一些见解.

Hope this at least gives some insights.

这篇关于PHP XMLReader,获取版本和编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆