查询非 XML 兼容的结构化数据 [英] Querying Non-XML compliant structured data

查看:34
本文介绍了查询非 XML 兼容的结构化数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为一名数据分析师,我经常处理包含某些专有格式的结构化数据的文件,并且无法进行正常的 XML 解析.

例如,我有一个大约一百个文档的档案,所有文档都以此开头:

我在下面包含了文档的一个节略示例,如果您对克隆感到冒犯,请不要阅读.

无论如何,有没有办法在没有 DTD、命名空间或 URI 或我需要的任何东西的情况下查询它?我可以使用 SQL Server 2012+ 或 xquery,或者,我不知道,php 或 vba.

<document synfileid="MCIESS0044"><galedata><project><项目名称><title>反对观点资源中心</title></项目名称></project></galedata><title>克隆</title></doc.head><doc.body><para>克隆是植物或动物的相同拷贝,由单一生物体的遗传物质产生.1996 年,英国科学家创造了一只名为多莉的绵羊,这是第一个成功克隆成年哺乳动物的动物.从那以后,科学家们成功地克隆了其他动物,如山羊、老鼠、猪和兔子.人们开始怀疑人类是否会成为下一个.是否应该允许人类克隆以及在什么条件下进行克隆的问题引发了许多具有挑战性的科学、法律和伦理问题&#x2014;包括作为人类意味着什么.</para><head n="1">科学背景</head><para>数千年来,人们一直在克隆植物.一些植物产生的后代没有来自其他生物体的任何遗传物质.在这些情况下,克隆只需切割植物的茎、根或叶的碎片,然后种植插条.剪裁将长成与原件相同的副本.许多常见的水果、蔬菜和观赏植物以这种方式由具有特别理想特性的亲本植物产生.<para>&#91;很多被排除的文本&#93;也许最令人困惑的问题是:克隆人如何看待自己的地位?作为复制品,他们是否会缺乏作为人类状况一部分的独特感?到目前为止,这些问题还没有答案&#x2014;也许永远不会.然而,关于克隆动物和人类的争论肯定会继续下去.存在创建克隆的技术.社会将如何使用这项技术?</para></doc.body></文档>

解决方案

您的 SGML 输入数据几乎 XML,事实上,与完整的 SGML 不同,XML 总是需要一个系统外部 DTD 子集的标识符(文件名或 URL),而不仅仅是公共标识符,例如 -//Gale Research//DTD Document V2.0//EN.所以 DOCTYPE 声明必须在 XML 中采用这种形式

我在其中添加了 "test.dtd" 作为外部子集的系统标识符/文件名.当然,现在必须存在 test.dtd 文件.在工作目录中创建一个空的 test.dtd 文件就足够了,或者 test.dtd 可以为您手头的标记包含一些有意义的声明,例如:

<!ELEMENT galedata ANY><!ELEMENT 项目 ANY><!ELEMENT projectname ANY><!ELEMENT 标题 ANY><!ELEMENT doc.head ANY><!ELEMENT doc.body ANY><!ELEMENT para ANY><!ELEMENT head ANY><!ATTLIST 文档 synfileid CDATA #IMPLIED><!ATTLIST head n NUMBER #IMPLIED>

但正如您所发现的,您也可以通过删除 DOCTYPE 行来使 XML 工具满意.

现在,如果您想在不进行手动编辑的情况下将文件处理为 XML,您可以使用 SGML 将文件预处理为兼容的 XML,然后针对生成的 XML 使用 XML 查询工具.

要使用 OpenSP/OpenJade SGML 包(可以通过例如 sudo apt-get install opensp 在 Ubuntu 上安装),您需要放置一个 catalog目录中的文件,包含以下行,告诉 SGML 将公共标识符 -//Gale Research//DTD Document V2.0//EN 解析为 test.dtd:

PUBLIC "-//Gale Research//DTD Document V2.0//EN" "test.dtd"

您可以从我上面给出的 XML 版本中编辑 test.dtd 以包含 标签遗漏指示符,默认情况下经典 SGML 需要这些指示符.但是,您的测试数据使用的另一个功能在 OpenSP 工具中默认未启用,即支持十六进制字符实体引用,例如示例数据中的 &#x2014;.因此,无论如何,我们将需要为您的数据使用自定义 SGML 声明(一种有点过时的纯文本,指定您的文档正在使用的 SGML 功能).因此,在此过程中,我们还将在 SGML 声明中声明 FEATURES MINIMIZE OMITTAG NO,这使得 SGML 接受没有标记遗漏指示符的元素声明,因此我们 必须从 XML 版本更改 test.dtd.

我们可以将 SGML 声明直接放入文档本身,但要避免手动编辑.我们将对 SGML 声明使用目录解析.如果您将以下行添加到您的 catalog 文件

SGMLDECL "xml10-sgmldecl.dcl"

然后 OpenSP 将使用存储在 xml10-sgmldecl.dcl 中的任何内容作为 SGML 声明.我们将要使用的实际 SGML 是 XML 1.0 的官方 SGML 声明(它已经具有我们想要的所有功能).您不需要详细了解 SGML 声明的含义.只需将下面附加的文本粘贴到名为 xml10-sgmldecl.dcl 的文件中;如果您对详细信息感兴趣,请参阅我在 http://sgmljs.net 上的描述/docs/sgmlrefman.html#sgml-declaration.

现在您可以调用

osx <你的文件>

从您的 SGML 生成 XML,没有错误.

<!SGML "ISO 8879:1986 (WWW)"-- XML 1.0 的 SGML 声明 ---  从:ISO 8879:1986 修订版 Web SGML 改编附件 (TC2) 的最终文本ISO/IEC JTC1/SC34 N0029:1998-12-06附件 L.2(资料性):用于 XML 的 SGML 声明为适应验证所做的更改用VALID:"注明——字符集BASESET "ISO 注册号 177//CHARSETISO/IEC 10646-1:1993 UCS-4 与实施3级//ESC 2/5 2/15 4/6"DESCSET0 9 未使用9 2 911 2 未使用13 1 1314 18 未使用32 95 32127 1 未使用128 32 未使用160 55136 16055296 2048 未使用——代理——57344 8190 5734465534 2 未使用 -- FFFE 和 FFFF --65536 1048576 65536CAPACITY NONE -- 容量在 XML 中不受限制 --范围文件句法顺查无BASESET "ISO 注册号 177//CHARSETISO/IEC 10646-1:1993 UCS-4 与实施3级//ESC 2/5 2/15 4/6"DESCSET0 1114112 0功能RE 13RS 10空间 32选项卡 SEPCHAR 9命名LCNMSTRT""UCNMSTRT""名称STRT58 95 192-214 216-246 248-305 308-318 321-328330-382 384-451 461-496 500-501 506-535 592-680699-705 902 904-906 908 910-929 931-974 976-982986 988 990 992 994-1011 1025-1036 1038-11031105-1116 1118-1153 1168-1220 1223-12241227-1228 1232-1259 1262-1269 1272-12731329-1366 1369 1377-1414 1488-1514 1520-15221569-1594 1601-1610 1649-1719 1722-17261728-1742 1744-1747 1749 1765-1766 2309-23612365 2392-2401 2437-2444 2447-2448 2451-24722474-2480 2482 2486-2489 2524-2525 2527-25292544-2545 2565-2570 2575-2576 2579-26002602-2608 2610-2611 2613-2614 2616-26172649-2652 2654 2674-2676 2693-2699 27012703-2705 2707-2728 2730-2736 2738-27392741-2745 2749 2784 2821-2828 2831-28322835-2856 2858-2864 2866-2867 2870-2873 28772908-2909 2911-2913 2949-2954 2958-29602962-2965 2969-2970 2972​​ 2974-2975 2979-29802984-2986 2990-2997 2999-3001 3077-30843086-3088 3090-3112 3114-3123 3125-31293168-3169 3205-3212 3214-3216 3218-32403242-3251 3253-3257 3294 3296-3297 3333-33403342-3344 3346-3368 3370-3385 3424-34253585-3630 3632 3634-3635 3648-3653 3713-37143716 3719-3720 3722 3725 3732-3735 3737-37433745-3747 3749 3751 3754-3755 3757-3758 37603762-3763 3773 3776-3780 3904-3911 3913-39454256-4293 4304-4342 4352 4354-4355 4357-43594361 4363-4364 4366-4370 4412 4414 4416 44284430 4432 4436-4437 4441 4447-4449 4451 44534455 4457 4461-4462 4466-4467 4469 4510 45204523 4526-4527 4535-4536 4538 4540-4546 45874592 4601 7680-7835 7840-7929 7936-79577960-7965 7968-8005 8008-8013 8016-8023 80258027 8029 8031-8061 8064-8116 8118-8124 81268130-8132 8134-8140 8144-8147 8150-81558160-8172 8178-8180 8182-8188 8486 8490-84918494 8576-8578 12295 12321-12329 12353-1243612449-12538 12549-12588 19968-40869 44032-55203LCNMCHAR ""UCNMCHAR ""名称字符45-46 183 720-721 768-837 864-865 903 1155-11581425-1441 1443-1465 1467-1469 1471 1473-14741476 1600 1611-1618 1632-1641 1648 1750-17641767-1768 1770-1773 1776-1785 2305-2307 23642366-2381 2385-2388 2402-2403 2406-24152433-2435 2492 2494-2500 2503-2504 2507-25092519 2530-2531 2534-2543 2562 2620 2622-26262631-2632 2635-2637 2662-2673 2689-2691 27482750-2757 2759-2761 2763-2765 2790-27992817-2819 2876 2878-2883 2887-2888 2891-28932902-2903 2918-2927 2946-2947 3006-30103014-3016 3018-3021 3031 3047-3055 3073-30753134-3140 3142-3144 3146-3149 3157-31583174-3183 3202-3203 3262-3268 3270-32723274-3277 3285-3286 3302-3311 3330-33313390-3395 3398-3400 3402-3405 3415 3430-34393633 3636-3642 3654-3662 3664-3673 37613764-3769 3771-3772 3782 3784-3789 3792-38013864-3865 3872-3881 3893 3895 3897 3902-39033953-3972 3974-3979 3984-3989 3991 3993-40134017-4023 4025 8400-8412 8417 12293 12330-1233512337-12341 12441-12442 12445-12446 12540-12542名牌一般编号实体编号德利姆一般 SGMLREFHCRO "&#38;#x"-- & 后跟#x"(不带引号)--巢/"网络>"PIC?>"简写 无姓名SGMLREF数量NONE -- 数量在 XML 中不受限制 --实体放大器" 38lt" 60GT"62引用" 34使徒" 39特征最小化数据标签号省略号排名短标签开始标签空号未公开号NETENABL IMMEDNET尾标空号未公开号属性默认是省略名称值编号EMPTYNRM 是隐含定义ATTLIST 是文件编号元素是实体编号注释 是关联简单不隐式否明确否其他不同意子文件编号正式编号骨灰盒编号KEEPRSRE 是有效性确认实体参考任何整体是无SEEALSO "ISO 8879//NOTATION 可扩展标记语言 (XML)1.0//EN">

As a data analyst, I am constantly running across files with structured data that are in some proprietary format and resist normal XML parsing.

For example, I have an archive of about a hundred documents that all begin with this:

<!DOCTYPE DOCUMENT PUBLIC "-//Gale Research//DTD Document V2.0//EN">

I have included an abridged example of the document below, don't read it if you're offended by cloning.

At any rate, is there a way to query this without having DTD or namespace or URI or whatever it is I need? I'm ok using SQL Server 2012+ or xquery or, I dunno, php or vba.

<!DOCTYPE DOCUMENT PUBLIC "-//Gale Research//DTD Document V2.0//EN">
<document synfileid="MCIESS0044">
<galedata><project>
<projectname>
<title>Opposing Viewpoints Resource Center</title>
</projectname>
</project></galedata>
<doc.head>
<title>Cloning</title>
</doc.head>
<doc.body>

<para>A clone is an identical copy of a plant or animal, produced from the genetic material of a single organism. In 1996 scientists in Britain created a sheep named Dolly, the first successful clone of an adult mammal. Since then, scientists have successfully cloned other animals, such as goats, mice, pigs, and rabbits. People began wondering if human beings would be next. The question of whether human cloning should be allowed, and under what conditions, raises a number of challenging scientific, legal, and ethical issues&#x2014;including what it means to be human.</para>

<head n="1">Scientific Background</head>
<para>People have been cloning plants for thousands of years. Some plants produce offspring without any genetic material from another organism. In these cases, cloning simply requires cutting pieces of the stems, roots, or leaves of the plants and then planting the cuttings. The cuttings will grow into identical copies of the originals. Many common fruits, vegetables, and ornamental plants are produced in this way from parent plants with especially desirable characteristics.</para>

<para>&#91;lots of excluded text&#93; Perhaps the most perplexing question of all: How would clones feel about their status? As a copy, would they lack the sense of uniqueness that is part of the human condition? As yet, such questions have no answers&#x2014;perhaps they never will. The debate about cloning, both animal and human, however, will certainly continue. The technology exists to create clones. How will society use this technology?</para>

</doc.body>
</document>

解决方案

Your SGML input data is almost XML, up to the fact that, unlike full SGML, XML always requires a system identifier (a file name or URL) for the external DTD subset, and not just a public identifier such as -//Gale Research//DTD Document V2.0//EN. So the DOCTYPE declaration must take this form in XML

<!DOCTYPE DOCUMENT
  PUBLIC "-//Gale Research//DTD Document V2.0//EN"
         "test.dtd">

where I've added "test.dtd" as system identifier/file name of the external subset. Of course, now a test.dtd file must exist. It is sufficient that an empty test.dtd file is created in the working directory, or test.dtd could contain some meaningful declarations for your markup at hand such as the following:

<!ELEMENT document ANY>
<!ELEMENT galedata ANY>
<!ELEMENT project ANY>
<!ELEMENT projectname ANY>
<!ELEMENT title ANY>
<!ELEMENT doc.head ANY>
<!ELEMENT doc.body ANY>
<!ELEMENT para ANY>
<!ELEMENT head ANY>
<!ATTLIST document synfileid CDATA #IMPLIED>
<!ATTLIST head n NUMBER #IMPLIED>

But as you've found out, you can also make XML tools happy by just removing the DOCTYPE line.

Now if you wanted to process your file(s) into XML without manual editing, you could use SGML to pre-process the file(s) into compliant XML, and then use XML query tools against the produced XML.

To do so using the OpenSP/OpenJade SGML package (can be installed on Ubuntu by eg. sudo apt-get install opensp), you'd place a catalog file in the directory, containing the following line telling SGML to resolve the public identifier -//Gale Research//DTD Document V2.0//EN to test.dtd:

PUBLIC "-//Gale Research//DTD Document V2.0//EN" "test.dtd"

You could edit test.dtd from the XML version I gave above to contain tag omission indicators which would be required for classic SGML by default. But there's another feature your test data is using which isn't enabled by default in OpenSP tools, namely support for hexadecimal character entity references such as &#x2014; in your example data. Hence, we're going to need to use a custom SGML declaration (a somewhat archaic piece of plain text specifying SGML features your document is using) for your data anyway. So while we're at it, we're also going to declare FEATURES MINIMIZE OMITTAG NO in the the SGML declaration, which makes SGML accept element declarations without tag omission indicators so we don't have to change test.dtd from the XML version.

We could place the SGML declaration right into the document itself, but to avoid manual editing. we're going to use catalog resolution for the SGML declaration. If you add the following line to your catalog file

SGMLDECL "xml10-sgmldecl.dcl"

then OpenSP will use whatever is stored in xml10-sgmldecl.dcl as SGML declaration. The actual SGML we're going to use is the official SGML declaration for XML 1.0 (which has all the features we want already). You don't need to understand the meaning of an SGML declaration in detail. Just paste the text attached below into a file named xml10-sgmldecl.dcl; if you're interested in the details, see my description at http://sgmljs.net/docs/sgmlrefman.html#sgml-declaration.

Now you'll be able to invoke

osx <your-file>

to produce XML from your SGML without errors.

<!SGML "ISO 8879:1986 (WWW)"

 -- SGML Declaration for XML 1.0 --

 -- from: 
    Final text of revised Web SGML Adaptations Annex (TC2) to ISO 8879:1986
    ISO/IEC JTC1/SC34 N0029: 1998-12-06
    Annex L.2 (informative): SGML Declaration for XML

    changes made to accommodate validation are noted with 'VALID:'
 --

 CHARSET
     BASESET "ISO Registration Number 177//CHARSET
             ISO/IEC 10646-1:1993 UCS-4 with implementation
             level 3//ESC 2/5 2/15 4/6"
     DESCSET
             0        9  UNUSED
             9        2       9
            11        2  UNUSED
            13        1      13
            14       18  UNUSED
            32       95      32
           127        1  UNUSED
           128       32  UNUSED
           160    55136     160
         55296     2048  UNUSED  -- surrogates --
         57344     8190   57344
         65534        2  UNUSED  -- FFFE and FFFF --
         65536  1048576   65536

 CAPACITY NONE  -- Capacities are not restricted in XML --

 SCOPE DOCUMENT

 SYNTAX
     SHUNCHAR NONE
     BASESET "ISO Registration Number 177//CHARSET
             ISO/IEC 10646-1:1993 UCS-4 with implementation
             level 3//ESC 2/5 2/15 4/6"
     DESCSET
         0 1114112 0
     FUNCTION
         RE    13
         RS    10
         SPACE 32
         TAB   SEPCHAR 9
     NAMING
         LCNMSTRT ""
         UCNMSTRT ""
         NAMESTRT
             58 95 192-214 216-246 248-305 308-318 321-328
             330-382 384-451 461-496 500-501 506-535 592-680
             699-705 902 904-906 908 910-929 931-974 976-982
             986 988 990 992 994-1011 1025-1036 1038-1103
             1105-1116 1118-1153 1168-1220 1223-1224
             1227-1228 1232-1259 1262-1269 1272-1273
             1329-1366 1369 1377-1414 1488-1514 1520-1522
             1569-1594 1601-1610 1649-1719 1722-1726
             1728-1742 1744-1747 1749 1765-1766 2309-2361
             2365 2392-2401 2437-2444 2447-2448 2451-2472
             2474-2480 2482 2486-2489 2524-2525 2527-2529
             2544-2545 2565-2570 2575-2576 2579-2600
             2602-2608 2610-2611 2613-2614 2616-2617
             2649-2652 2654 2674-2676 2693-2699 2701
             2703-2705 2707-2728 2730-2736 2738-2739
             2741-2745 2749 2784 2821-2828 2831-2832
             2835-2856 2858-2864 2866-2867 2870-2873 2877
             2908-2909 2911-2913 2949-2954 2958-2960
             2962-2965 2969-2970 2972 2974-2975 2979-2980
             2984-2986 2990-2997 2999-3001 3077-3084
             3086-3088 3090-3112 3114-3123 3125-3129
             3168-3169 3205-3212 3214-3216 3218-3240
             3242-3251 3253-3257 3294 3296-3297 3333-3340
             3342-3344 3346-3368 3370-3385 3424-3425
             3585-3630 3632 3634-3635 3648-3653 3713-3714
             3716 3719-3720 3722 3725 3732-3735 3737-3743
             3745-3747 3749 3751 3754-3755 3757-3758 3760
             3762-3763 3773 3776-3780 3904-3911 3913-3945
             4256-4293 4304-4342 4352 4354-4355 4357-4359
             4361 4363-4364 4366-4370 4412 4414 4416 4428
             4430 4432 4436-4437 4441 4447-4449 4451 4453
             4455 4457 4461-4462 4466-4467 4469 4510 4520
             4523 4526-4527 4535-4536 4538 4540-4546 4587
             4592 4601 7680-7835 7840-7929 7936-7957
             7960-7965 7968-8005 8008-8013 8016-8023 8025
             8027 8029 8031-8061 8064-8116 8118-8124 8126
             8130-8132 8134-8140 8144-8147 8150-8155
             8160-8172 8178-8180 8182-8188 8486 8490-8491
             8494 8576-8578 12295 12321-12329 12353-12436
             12449-12538 12549-12588 19968-40869 44032-55203

         LCNMCHAR ""
         UCNMCHAR ""
         NAMECHAR
             45-46 183 720-721 768-837 864-865 903 1155-1158
             1425-1441 1443-1465 1467-1469 1471 1473-1474
             1476 1600 1611-1618 1632-1641 1648 1750-1764
             1767-1768 1770-1773 1776-1785 2305-2307 2364
             2366-2381 2385-2388 2402-2403 2406-2415
             2433-2435 2492 2494-2500 2503-2504 2507-2509
             2519 2530-2531 2534-2543 2562 2620 2622-2626
             2631-2632 2635-2637 2662-2673 2689-2691 2748
             2750-2757 2759-2761 2763-2765 2790-2799
             2817-2819 2876 2878-2883 2887-2888 2891-2893
             2902-2903 2918-2927 2946-2947 3006-3010
             3014-3016 3018-3021 3031 3047-3055 3073-3075
             3134-3140 3142-3144 3146-3149 3157-3158
             3174-3183 3202-3203 3262-3268 3270-3272
             3274-3277 3285-3286 3302-3311 3330-3331
             3390-3395 3398-3400 3402-3405 3415 3430-3439
             3633 3636-3642 3654-3662 3664-3673 3761
             3764-3769 3771-3772 3782 3784-3789 3792-3801
             3864-3865 3872-3881 3893 3895 3897 3902-3903
             3953-3972 3974-3979 3984-3989 3991 3993-4013
             4017-4023 4025 8400-8412 8417 12293 12330-12335
             12337-12341 12441-12442 12445-12446 12540-12542

         NAMECASE
             GENERAL NO
             ENTITY  NO
     DELIM
         GENERAL  SGMLREF
         HCRO     "&#38;#x"
                  -- Ampersand followed by "#x" (without quotes) --
         NESTC    "/"
         NET      ">"
         PIC      "?>"
         SHORTREF NONE

     NAMES
         SGMLREF

     QUANTITY
         NONE -- Quantities are not restricted in XML --

     ENTITIES
         "amp"  38
         "lt"   60
         "gt"   62
         "quot" 34
         "apos" 39

 FEATURES
     MINIMIZE
         DATATAG NO
         OMITTAG NO
         RANK    NO
         SHORTTAG
             STARTTAG
                 EMPTY    NO
                 UNCLOSED NO
                 NETENABL IMMEDNET
             ENDTAG
                 EMPTY    NO
                 UNCLOSED NO
             ATTRIB
                 DEFAULT  YES
                 OMITNAME NO
                 VALUE    NO
         EMPTYNRM  YES
         IMPLYDEF
             ATTLIST  YES
             DOCTYPE  NO
             ELEMENT  YES
             ENTITY   NO
             NOTATION YES
     LINK
         SIMPLE   NO
         IMPLICIT NO
         EXPLICIT NO
     OTHER
         CONCUR   NO
         SUBDOC   NO
         FORMAL   NO
         URN      NO
         KEEPRSRE YES
         VALIDITY NOASSERT
         ENTITIES
             REF      ANY
             INTEGRAL YES

 APPINFO NONE

 SEEALSO "ISO 8879//NOTATION Extensible Markup Language (XML)
 1.0//EN">

这篇关于查询非 XML 兼容的结构化数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆