加快针对同一XML模式(XSD)的一批XML文件的XML模式验证 [英] Speeding up XML schema validations of a batch of XML files against the same XML schema (XSD)

查看:87
本文介绍了加快针对同一XML模式(XSD)的一批XML文件的XML模式验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想加快针对同一单个XML模式(XSD)验证一批XML文件的过程.唯一的限制是我在PHP环境中.

I would like to speed up the process of validating a batch of XML files against the same single XML schema (XSD). Only restrictions are that I am in a PHP environment.

我当前的问题是我要验证的架构包括2755行的相当复杂的xhtml架构(http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd). 即使对于非常简单的数据,也要花费很长时间(大约30秒钟的验证时间). 由于我的批处理中有成千上万个XML文件,因此扩展性确实不好.

My current problem is that the schema I would like to validate against includes the fairly complex xhtml schema of 2755 lines (http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd). Even for very simple data this takes a long time (around 30 seconds pr. validation). As I have thousands of XML files in my batch, this doesn't really scale well.

为了验证XML文件,我使用了标准php-xml库中的这两种方法.

For validating the XML file I use both of these methods, from the standard php-xml libraries.

  • DOMDocument :: schemaValidate
  • DOMDocument :: schemaValidateSource

我认为PHP实现通过HTTP提取XHTML模式并构建一些内部表示形式(可能是DOMDocument),并且在完成验证后将其丢弃.我当时认为XML库的某些选项可能会更改此行为,以便在此过程中缓存某些内容以供重用.

I am thinking that the PHP implementation fetches the XHTML schema via HTTP and builds some internal representation (possibly a DOMDocument) and that this is thrown away when the validation is completed. I was thinking that some option for the XML-libs might change this behaviour to cache something in this process for reuse.

我已经建立了一个简单的测试设置来说明我的问题:

I've build a simple test setup which illustrates my problem:

test-schema.xsd

<xs:schema attributeFormDefault="unqualified"
    elementFormDefault="qualified"
    targetNamespace="http://myschema.example.com/"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:myschema="http://myschema.example.com/"
    xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <xs:import
        schemaLocation="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
        namespace="http://www.w3.org/1999/xhtml">
    </xs:import>
    <xs:element name="Root">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="MyHTMLElement">
                    <xs:complexType>
                        <xs:complexContent>
                            <xs:extension base="xhtml:Flow"></xs:extension>
                        </xs:complexContent>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

test-data.xml

<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://myschema.example.com/ test-schema.xsd ">
  <MyHTMLElement>
    <xhtml:p>This is an XHTML paragraph!</xhtml:p>
  </MyHTMLElement>
</Root>

schematest.php

<?php
$data_dom = new DOMDocument();
$data_dom->load('test-data.xml');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidate: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidate('test-schema.xsd')) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

// Loading schema into a string.
$schema_source = file_get_contents('test-schema.xsd');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidateSource: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidateSource($schema_source)) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

运行此schematest.php文件将产生以下输出:

Running this schematest.php file produces the following output:

schemaValidate: Attempt #1 returns Valid! in 30 seconds.
schemaValidate: Attempt #2 returns Valid! in 30 seconds.
schemaValidate: Attempt #3 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 32 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 30 seconds.

非常欢迎您提供有关如何解决此问题的帮助和建议!

Any help and suggestions on how to solve this issue, are very welcomed!

推荐答案

您可以安全地从计时值中减去30秒作为开销.

You can safely substract 30 seconds from the timing values as overhead.

对W3C服务器的远程请求被延迟了,因为大多数库不反映对文档的缓存(甚至HTTP标头也表明了这一点).但是阅读您自己的书:

Remote requests to W3C servers are being delayed because most libraries do not reflect caching the documents (even the HTTP headers suggest that). But read your own:

W3C服务器返回DTD的速度很慢.故意拖延吗?

是的.由于各种软件系统每天要从我们的站点上下载DTD数百万次(尽管我们的服务器使用了缓存指令),因此我们开始通过站点上的站点为DTD和模式(DTD,XSD,ENT,MOD等)提供服务.人为延迟.我们这样做的目的是使我们更多地关注DTD流量过大的持续问题,并保护站点其余部分的稳定性和响应时间.我们建议使用HTTP缓存或目录文件来提高性能.

Yes. Due to various software systems downloading DTDs from our site millions of times a day (despite the caching directives of our servers), we have started to serve DTDs and schema (DTD, XSD, ENT, MOD, etc.) from our site with an artificial delay. Our goals in doing so are to bring more attention to our ongoing issues with excessive DTD traffic, and to protect the stability and response time of the rest of our site. We recommend HTTP caching or catalog files to improve performance.

W3.org尝试将请求保持在较低水平.这是可以理解的. PHP的DomDocument基于libxml. libxml允许设置一个外部实体加载器.在这种情况下,整个 目录支持部分很有趣.

W3.org tries to keep requests low. That is understandable. PHP's DomDocument is based on libxml. And libxml allows to set an external entity loader. The whole Catalog support section is interesting in this case.

要解决相关问题,请设置catalog.xml文件:

To solve the issue in question, setup a catalog.xml file:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
            uri="xhtml1-transitional.xsd"/>
    <system systemId="http://www.w3.org/2001/xml.xsd"
            uri="xml.xsd"/>
</catalog>

将两个.xsd文件的副本保存在目录旁边该目录文件中给出的名称(相对路径和绝对路径file:///...如果您希望使用其他目录也可以使用).

Save a copy of the two .xsd files with the names given in that catalog file next to the catalog (relative as well as absolute paths file:///... do work if you prefer a different directory).

然后确保将系统环境变量XML_CATALOG_FILES设置为catalog.xml文件的文件名.一切设置完成后,验证过程将一直进行:

Then ensure your systems environment variable XML_CATALOG_FILES is set to the filename of the catalog.xml file. When everything is setup, the validation just runs through:

schemaValidate: Attempt #1 returns Valid! in 0 seconds.
schemaValidate: Attempt #2 returns Valid! in 0 seconds.
schemaValidate: Attempt #3 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 0 seconds.

如果仍然需要很长时间,则仅表示环境变量未设置在正确的位置.在博客文章中,我已经处理了变量以及一些边缘情况:

If it still takes long, it's just a sign that the environment variable is not set to the right location. I have handled the variable as well as some edge cases as well in a blog post:

它应该处理各种边缘情况,例如文件名中包含空格.

It should take care of diverse edge cases, like filenames containing spaces.

或者可以创建一个简单的外部实体加载器回调函数,该函数使用URL =>文件映射以数组的形式用于本地文件系统:

Alternatively it is possible to create a simple external entity loader callback function that uses a URL => file mapping for the local file-system in form of an array:

$mapping = [
     'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd'
         => 'schema/xhtml1-transitional.xsd',

     'http://www.w3.org/2001/xml.xsd'                          
         => 'schema/xml.xsd',
];

如此所示,我将这两个XSD文件的逐字副本放置在名为schema的子目录中.下一步是利用 libxml_set_external_entity_loader 激活具有映射的回调函数.磁盘上已经存在的文件是首选文件并直接加载.如果例程遇到没有映射的非文件,则会抛出RuntimeException并显示一条详细消息:

As this shows, I've placed a verbatim copy of these two XSD files into a subdirectory called schema. The next step is to make use of libxml_set_external_entity_loader to activate the callback function with the mapping. Files that exist on disk already are preferred and loaded directly. If the routine encounters a non-file that has no mapping, a RuntimeException will be thrown with a detailed message:

libxml_set_external_entity_loader(
    function ($public, $system, $context) use ($mapping) {

        if (is_file($system)) {
            return $system;
        }

        if (isset($mapping[$system])) {
            return __DIR__ . '/' . $mapping[$system];
        }

        $message = sprintf(
            "Failed to load external entity: Public: %s; System: %s; Context: %s",
            var_export($public, 1), var_export($system, 1),
            strtr(var_export($context, 1), [" (\n  " => '(', "\n " => '', "\n" => ''])
        );

        throw new RuntimeException($message);
    }
);

设置了此外部实体加载器后,远程请求不再存在延迟.

After setting this external entity loader, there isn't any longer the delay for the remote-requests.

就是这样.参见要点.注意:编写此外部实体加载程序的目的是为了加载XML文件,以便从磁盘进行验证并将XSD URI解析"为本地文件名.其他类型的操作(例如,基于DTD的验证)可能需要一些代码更改/扩展.更可取的是XML目录.它也适用于不同的工具.

And that's it. See Gist. Take care: This external entity loader has been written for loading the XML file to validate from disk and to "resolve" the XSD URIs to local filenames. Other kind of operations (e.g. DTD based validation) might need some code changes / extension. More preferable is the XML catalog. It also works for different tools.

这篇关于加快针对同一XML模式(XSD)的一批XML文件的XML模式验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆