从html中嵌入xml中提取xml [英] extract xml from xml embebed in html

查看:113
本文介绍了从html中嵌入xml中提取xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

,但它有点棘手,因为他们不给任何支持。其目的是为了获取xml到php来获取XML。



有人可以给出提示吗?

解决方案

通过其中的HTML呈现的XML也不是XML。



您要查找的内容是 textContent 中的 DOM文档 。这将只给你来自该HMTL的文本。就像它在浏览器中显示为文本一样。



因此,您只需将HTML文档加载到 DOMDocument 。因为它包含错误,所以使用内部错误:

  $ url ='http://www.ncbi.nlm.nih。 ?GOV / SRA / ERX086768报告= FullXml'; 

$ doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$ doc-> loadHTMLFile($ url);
libxml_use_internal_errors(FALSE);

下一部分意味着关于被抓取页面的具体知识。在你的情况下,XML是所有具有class属性xml-tag *的div标签的 text-content ResultView。



使用xpath查询可以轻松获取这些标记,然后将它们的文本内容存储到数组中:

  $ xpath = new DOMXPath($ doc); 
$ nodes = $ xpath-> query('// * [@ id =ResultView] / following-sibling :: div [@ class =xml-tag]');
$ buffer = array();
foreach($ nodes为$ node){
$ buffer [] = $ node-> textContent;
}

现在剩下的东西就是创建一个新的 DOMDocument 并将该XML缓冲区加载到它中,执行一些很好的格式化和输出:

  $ new =新的DOMDocument(); 
$ new-> preserveWhiteSpace = FALSE;
$ new-> formatOutput = TRUE;
$ new-> loadXML(implode('',$ buffer));
$ new-> save('php:// output');

这些大概20行代码会产生以下输出:

 <?xml version =1.0?> 
< EXPERIMENT_PACKAGE>
< EXPERIMENT alias =SC_EXP_7229_8#56center_name =SCaccession =ERX086768>
< IDENTIFIERS>
< PRIMARY_ID> ERX086768< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> SC_EXP_7229_8#56< / SUBMITTER_ID>
< / IDENTIFIERS>
< TITLE />
< STUDY_REF accession =ERP000913refname =Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977refcenter =SC>
< IDENTIFIERS>
< PRIMARY_ID> ERP000913< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977< / SUBMITTER_ID>
< / IDENTIFIERS>
< / STUDY_REF>
< DESIGN>
< DESIGN_DESCRIPTION>标准< / DESIGN_DESCRIPTION>
< SAMPLE_DESCRIPTOR accession =ERS074283refname =MR223754-sc-2011-11-18T11:31:44Z-1306470refcenter =SC>
< IDENTIFIERS>
< PRIMARY_ID> ERS074283< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> MR223754-sc-2011-11-18T11:31:44Z-1306470< / SUBMITTER_ID>
< / IDENTIFIERS>
< / SAMPLE_DESCRIPTOR>
< LIBRARY_DESCRIPTOR>
< LIBRARY_NAME> 4008297< / LIBRARY_NAME>
< LIBRARY_STRATEGY> WGS< / LIBRARY_STRATEGY>
< LIBRARY_SOURCE> GENOMIC< / LIBRARY_SOURCE>
< LIBRARY_SELECTION> RANDOM< / LIBRARY_SELECTION>
< LIBRARY_LAYOUT>
< PAIRED NOMINAL_LENGTH =250/>
< / LIBRARY_LAYOUT>
< / LIBRARY_DESCRIPTOR>
< SPOT_DESCRIPTOR>
< SPOT_DECODE_SPEC>
< READ_SPEC>
< READ_INDEX> 0< / READ_INDEX>
< READ_CLASS>应用程式读取< / READ_CLASS>
< READ_TYPE>转发< / READ_TYPE>
< BASE_COORD> 1< / BASE_COORD>
< / READ_SPEC>
< READ_SPEC>
< READ_INDEX> 1< / READ_INDEX>
< READ_CLASS>应用程式读取< / READ_CLASS>
< READ_TYPE>反转< / READ_TYPE>
< RELATIVE_ORDER following_read_index =0/>
< / READ_SPEC>
< / SPOT_DECODE_SPEC>
< / SPOT_DESCRIPTOR>
< / DESIGN>
< PLATFORM>
< ILLUMINA>
< INSTRUMENT_MODEL> Illumina HiSeq 2000< / INSTRUMENT_MODEL>
< / ILLUMINA>
< / PLATFORM>
< PROCESSING />
< /实验>
< SUBMISSION accession =ERA119046center_name =SCsubmission_date =2012-04-17T09:29:50Zalias =ERP000913-sc-20120417-2lab_name =>
< IDENTIFIERS>
< PRIMARY_ID> ERA119046< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> ERP000913-sc-20120417-2< / SUBMITTER_ID>
< / IDENTIFIERS>
< /提交>
< STUDY alias =Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977center_name =SCaccession =ERP000913>
< IDENTIFIERS>
< PRIMARY_ID> ERP000913< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977< / SUBMITTER_ID>
< / IDENTIFIERS>
< DESCRIPTOR>
< STUDY_TITLE> Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis< / STUDY_TITLE>
< STUDY_TYPE existing_study_type =全基因组测序/>
< STUDY_ABSTRACT> http://www.sanger.ac.uk/resources/downloads/bacteria/< / STUDY_ABSTRACT>
< CENTER_PROJECT_NAME> Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis< / CENTER_PROJECT_NAME>
< STUDY_DESCRIPTION> http://www.sanger.ac.uk/resources/downloads/bacteria/
此数据是预发布版本的一部分。有关正确使用威康信托桑格研究所共享的出版前数据的信息(包括任何出版物暂停的详细信息),请参阅http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION>
< / DESCRIPTOR>
< / STUDY>
< SAMPLE alias =MR223754-sc-2011-11-18T11:31:44Z-1306470center_name =SCaccession =ERS074283>
< IDENTIFIERS>
< PRIMARY_ID> ERS074283< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> MR223754-sc-2011-11-18T11:31:44Z-1306470< / SUBMITTER_ID>
< / IDENTIFIERS>
< SAMPLE_NAME>
< COMMON_NAME>停乳链球菌亚种equisimilis< / COMMON_NAME>
< TAXON_ID> 119602< / TAXON_ID>
< SCIENTIFIC_NAME>停乳链球菌亚种(Streptococcus dysgalactiae subsp。 &似马LT; / SCIENTIFIC_NAME>
< / SAMPLE_NAME>
>
< SAMPLE_LINK>
< ENTREZ_LINK>
< DB>生物样本< / DB>
< ID> 859730< / ID>
< / ENTREZ_LINK>
< / SAMPLE_LINK>
< / SAMPLE_LINK>>
< SAMPLE_ATTRIBUTES>
< SAMPLE_ATTRIBUTE>
< TAG> Strain< / TAG>
< VALUE> MR223754< / VALUE>
< / SAMPLE_ATTRIBUTE>
< SAMPLE_ATTRIBUTE>
< TAG>样本描述< / TAG>
< VALUE />
< / SAMPLE_ATTRIBUTE>
< SAMPLE_ATTRIBUTE>
< TAG> ArrayExpress-StrainOrLine< / TAG>
< VALUE> MR223754< / VALUE>
< / SAMPLE_ATTRIBUTE>
< SAMPLE_ATTRIBUTE>
< TAG> ArrayExpress-Sex< / TAG>
< VALUE>不适用< / VALUE>
< / SAMPLE_ATTRIBUTE>
< SAMPLE_ATTRIBUTE>
< TAG> ArrayExpress-Species< / TAG>
< VALUE>停乳链球菌亚种equisimilis< / VALUE>
< / SAMPLE_ATTRIBUTE>
< / SAMPLE_ATTRIBUTES>
< / SAMPLE>
< RUN_SET>
< RUN alias =SC_RUN_7229_8#56center_name =SCaccession =ERR109334total_spots =2708543total_bases =406281450size =334475592load_done =truepublished =2012-04 -27 20:11:35is_public =truecluster_name =publicstatic_data_available =1>
< IDENTIFIERS>
< PRIMARY_ID> ERR109334< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> SC_RUN_7229_8#56< / SUBMITTER_ID>
< / IDENTIFIERS>
< EXPERIMENT_REF refname =SC_EXP_7229_8#56refcenter =SCaccession =ERX086768>
< IDENTIFIERS>
< PRIMARY_ID> ERX086768< / PRIMARY_ID>
< SUBMITTER_ID namespace =SC> SC_EXP_7229_8#56< / SUBMITTER_ID>
< / IDENTIFIERS>
< / EXPERIMENT_REF>
< Pool>
<成员member_name =accession =ERS074283sample_name =MR223754-sc-2011-11-18T11:31:44Z-1306470spots =2708543bases =406281450/>
< /泳池>
< / RUN>
< / RUN_SET>
< / EXPERIMENT_PACKAGE>

所以不要重新发明轮子,只要了解现有的工具。有时候比一见钟情更容易。


im trying to get the xml presented here http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml but its a bit tricky cause they dont give any suport for it. The purpose is to get the xml to php in order to go trought the xml.

can someone give a hint?

解决方案

It's not really true that XML presented via HTML therein wouldn't be XML as well.

What you're looking for is something called textContent in DOMDocument. That will give you only the text from that HMTL. Like it is displayed "as text" in the browser.

So all you need to do is to load the HTML document into a DOMDocument. Because it contains errors the internal error are used:

$url = 'http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml';

$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);

The next part implies specific knowledge about the page being scraped. In your case the XML is the said text-content of all div-tags with class attribute "xml-tag" *followed* after the tag with the id "ResultView".

These tags can be easily fetched with an xpath query, then their text-content is stored into an array:

$xpath  = new DOMXPath($doc);
$nodes  = $xpath->query('//*[@id="ResultView"]/following-sibling::div[@class="xml-tag"]');
$buffer = array();
foreach ($nodes as $node) {
    $buffer[] = $node->textContent;
}

So everything left now is to create a new DOMDocument and load that XML buffer into it, doing some nice formattings and the output:

$new = new DOMDocument();
$new->preserveWhiteSpace = FALSE;
$new->formatOutput = TRUE;
$new->loadXML(implode('', $buffer));
$new->save('php://output');

These roughly 20 lines of code produce the following output then:

<?xml version="1.0"?>
<EXPERIMENT_PACKAGE>
  <EXPERIMENT alias="SC_EXP_7229_8#56" center_name="SC" accession="ERX086768">
    <IDENTIFIERS>
      <PRIMARY_ID>ERX086768</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
    </IDENTIFIERS>
    <TITLE/>
    <STUDY_REF accession="ERP000913" refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" refcenter="SC">
      <IDENTIFIERS>
        <PRIMARY_ID>ERP000913</PRIMARY_ID>
        <SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
      </IDENTIFIERS>
    </STUDY_REF>
    <DESIGN>
      <DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION>
      <SAMPLE_DESCRIPTOR accession="ERS074283" refname="MR223754-sc-2011-11-18T11:31:44Z-1306470" refcenter="SC">
        <IDENTIFIERS>
          <PRIMARY_ID>ERS074283</PRIMARY_ID>
          <SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
        </IDENTIFIERS>
      </SAMPLE_DESCRIPTOR>
      <LIBRARY_DESCRIPTOR>
        <LIBRARY_NAME>4008297</LIBRARY_NAME>
        <LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
        <LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
        <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
        <LIBRARY_LAYOUT>
          <PAIRED NOMINAL_LENGTH="250"/>
        </LIBRARY_LAYOUT>
      </LIBRARY_DESCRIPTOR>
      <SPOT_DESCRIPTOR>
        <SPOT_DECODE_SPEC>
          <READ_SPEC>
            <READ_INDEX>0</READ_INDEX>
            <READ_CLASS>Application Read</READ_CLASS>
            <READ_TYPE>Forward</READ_TYPE>
            <BASE_COORD>1</BASE_COORD>
          </READ_SPEC>
          <READ_SPEC>
            <READ_INDEX>1</READ_INDEX>
            <READ_CLASS>Application Read</READ_CLASS>
            <READ_TYPE>Reverse</READ_TYPE>
            <RELATIVE_ORDER follows_read_index="0"/>
          </READ_SPEC>
        </SPOT_DECODE_SPEC>
      </SPOT_DESCRIPTOR>
    </DESIGN>
    <PLATFORM>
      <ILLUMINA>
        <INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL>
      </ILLUMINA>
    </PLATFORM>
    <PROCESSING/>
  </EXPERIMENT>
  <SUBMISSION accession="ERA119046" center_name="SC" submission_date="2012-04-17T09:29:50Z" alias="ERP000913-sc-20120417-2" lab_name="">
    <IDENTIFIERS>
      <PRIMARY_ID>ERA119046</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID>
    </IDENTIFIERS>
  </SUBMISSION>
  <STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" center_name="SC" accession="ERP000913">
    <IDENTIFIERS>
      <PRIMARY_ID>ERP000913</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
    </IDENTIFIERS>
    <DESCRIPTOR>
      <STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE>
      <STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
      <STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT>
      <CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME>
      <STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION>
    </DESCRIPTOR>
  </STUDY>
  <SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470" center_name="SC" accession="ERS074283">
    <IDENTIFIERS>
      <PRIMARY_ID>ERS074283</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
    </IDENTIFIERS>
    <SAMPLE_NAME>
      <COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME>
      <TAXON_ID>119602</TAXON_ID>
      <SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME>
    </SAMPLE_NAME>
    <SAMPLE_LINKS>
      <SAMPLE_LINK>
        <ENTREZ_LINK>
          <DB>biosample</DB>
          <ID>859730</ID>
        </ENTREZ_LINK>
      </SAMPLE_LINK>
    </SAMPLE_LINKS>
    <SAMPLE_ATTRIBUTES>
      <SAMPLE_ATTRIBUTE>
        <TAG>Strain</TAG>
        <VALUE>MR223754</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>Sample Description</TAG>
        <VALUE/>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-StrainOrLine</TAG>
        <VALUE>MR223754</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-Sex</TAG>
        <VALUE>not applicable</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-Species</TAG>
        <VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE>
      </SAMPLE_ATTRIBUTE>
    </SAMPLE_ATTRIBUTES>
  </SAMPLE>
  <RUN_SET>
    <RUN alias="SC_RUN_7229_8#56" center_name="SC" accession="ERR109334" total_spots="2708543" total_bases="406281450" size="334475592" load_done="true" published="2012-04-27 20:11:35" is_public="true" cluster_name="public" static_data_available="1">
      <IDENTIFIERS>
        <PRIMARY_ID>ERR109334</PRIMARY_ID>
        <SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID>
      </IDENTIFIERS>
      <EXPERIMENT_REF refname="SC_EXP_7229_8#56" refcenter="SC" accession="ERX086768">
        <IDENTIFIERS>
          <PRIMARY_ID>ERX086768</PRIMARY_ID>
          <SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
        </IDENTIFIERS>
      </EXPERIMENT_REF>
      <Pool>
        <Member member_name="" accession="ERS074283" sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470" spots="2708543" bases="406281450"/>
      </Pool>
    </RUN>
  </RUN_SET>
</EXPERIMENT_PACKAGE>

So don't re-invent the wheel, just learn about the existing tools. It's sometimes more easy than it looks like on first sight.

这篇关于从html中嵌入xml中提取xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆