如何在XML文件中搜索字符串并将其写入文本文件 [英] How do I search strings in XML file and write them to a text file

查看:98
本文介绍了如何在XML文件中搜索字符串并将其写入文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,

我有200个xml文件。每一个都包含路径(类似于网络信息)。每个途径由具有某些属性的实体组成。我想问一下如何为每个xml文件创建一个文本文件,该文件包含此xml文件中所有实体的唯一名称属性。我有这种格式的xml文件:



Hello,
I have 200 xml files. each one consists of pathways (something like network information). each pathway consists of entities with some attributes. I would like to ask how I can creat a text file for each xml file that contain the only name attributes for all the entities inside this xml file. I have the xml files in this format:

<?xml version="1.0" ?> 
  <!DOCTYPE pathway (View Source for full doctype...)> 
- <!--  Creation date: Oct 7, 2014 11:01:31 +0900 (GMT+09:00) 
  --> 
- <pathway name="path:gmx00010" org="gmx" number="00010" title="Glycolysis / Gluconeogenesis">

- <entry id="13" name="gmx:100527532 gmx:100775844 gmx:100778363 gmx:100786504 gmx:100792394 gmx:100795446 gmx:100798677 gmx:100802732 gmx:100815070 gmx:100818383 gmx:100818915 gmx:547751" type="gene" >
  </entry>

- <entry id="37" name="gmx:100777399 gmx:100778722 gmx:100782019 gmx:100783726 gmx:100784210 gmx:100786773 gmx:100798020 gmx:100798892 gmx:100800699 gmx:100803104 gmx:100808513 gmx:100809812 gmx:100811186 gmx:100811501 gmx:100811891 gmx:100816594 gmx:100817701 gmx:100819197 gmx:547717" type="gene">
  </entry>

- <entry id="38" name="ko:K01905" type="ortholog">
  </entry>

- <entry id="39" name="ko:K00129" type="ortholog">
  </entry>





我想用Visual C ++编写一个程序来创建一个与xml文件具有相同标题的文本文件,这个文本文件包含name属性值(例如:gmx:100527532 gmx:100775844 gmx:100778363 gmx:100786504 gmx: 100792394 gmx:100795446 gmx:100798677 gmx:100802732 gmx:100815070 gmx:100818383 gmx:100818915 gmx:547751)对于所有类型=gene的实体并忽略任何其他类型的实体。



谢谢。



I want to write a program in visual C++ to create a text file with the same title as the xml file and this text file contains the name attribute values (ex: gmx:100527532 gmx:100775844 gmx:100778363 gmx:100786504 gmx:100792394 gmx:100795446 gmx:100798677 gmx:100802732 gmx:100815070 gmx:100818383 gmx:100818915 gmx:547751) for all the entities of type="gene" and ignore any entity with other types.

Thanks.

推荐答案

有几种方法可以解决这个问题。最灵活的方法是使用XSLT,以便在您的需求发生变化时(他们会),您可以修改您的XSL文件并将其应用于数据。



少灵活的方法是使用XMLReader一次读取XML节点并硬编码您要搜索的内容。这对于一次性要求是足够好的,但是如果你可以节省时间学习XSL和XPath的基础知识可能会在长期内节省时间。请参阅评论中的链接。



如果您正在使用基因映射我怀疑您的文件将会非常大,所以您将需要避免解决方案需要将整个文件加载到内存中。



下面的示例代码是C#,但如果你在.Net中使用Visual C ++,那么翻译应该是直截了当的。





XSL路线:

您可以通过多种方式在.Net中应用XSL转换。对于较小的文件,可以使用如下代码。这将XML作为字符串,应用返回字符串的转换。要使用它,你会读取打开你的XML文件,将内容读成字符串



There's several ways you can go with this. The most flexible approach is to use XSLT so that when your requirements change (and they will) you can modify your XSL file and apply that to the data.

A less flexible approach is to use an XMLReader to read the XML a node at a time and hard code what you are searching for. It's good enough for one off requirements, but if you can spare the time learning the basics of XSL and XPath is likely to prove a time saver over the long run. See the links in the comments.

If you are working with gene mapping I suspect that your files are going to be very large so you're going to want avoid solutions that require the whole file to be loaded into memory.

The example code below is C# but if you're using Visual C++ in .Net the translation should be straightforward.


The XSL route:
There are a number of ways you can apply XSL transforms in .Net you. For smaller files can use code like the following. This takes XML as a string, applies the transform returning a string. To use this you would read open your XML file read the content into a string

if (File.Exists(outputPath)) File.Delete(outputPath);
string output = ApplyTransform(xmlToTransform, xslTemplate);
StreamWriter writer = new StreamWriter(outputPath, false, Encoding.Unicode);
writer.Write(output);
writer.Flush();
writer.Close();





如果你有一个类似下面所示的方法来应用转换。这里有两个问题:

1 - 你必须学习XSL和XPath,不是太困难但它确实需要时间而且确实有一些问题。

2 - 一切都是在记忆中完成这限制了您可以使用的文件大小,对于较大的XML文档可能会非常慢。







Where you have a method something like the one shown below to apply the transform. Two problems here:
1 - You have to learn XSL and XPath, not too difficult but it does take time and it does have a few gotchas.
2 - Everything is done in memory. This limits this size of file you can work with and can be very slow for larger XML documents.


/// <summary>
/// Apply an XSL transform to a well formed XML string
/// returning the transform output as a string.
/// </summary>
/// <param name="xmlToTransform">Well formed XML as a string.</param>
/// <param name="xslTemplate">Full path to an XSL template file.</param>
/// <returns></returns>
public static string ApplyTransform(string xmlToTransform,
                                    string xslTemplate)
{

  XmlReader reader = null;
  XmlWriter writer = null;
  StringWriter sw = new StringWriter();

  try
  {

    // Using a reader allows us to use stylesheets with embedded DTD.
    XmlReaderSettings readSettings = new XmlReaderSettings();
    readSettings.ProhibitDtd = false;
    reader = XmlReader.Create(xslTemplate, readSettings);

    // We want the output indented by tag.
    XmlWriterSettings writeSettings = new XmlWriterSettings();
    writeSettings.OmitXmlDeclaration = true;
    writeSettings.ConformanceLevel = ConformanceLevel.Fragment;
    writeSettings.CloseOutput = true;
    writeSettings.Indent = true;
    writeSettings.IndentChars = "  ";
    writeSettings.NewLineChars = System.Environment.NewLine;
    writeSettings.Encoding = Encoding.Unicode;
    writeSettings.CheckCharacters = false;
    writer = XmlWriter.Create(sw, writeSettings);

    // Turn the incoming string into something we can apply a
    // a transform to.
    XmlDocument dbSchema = new XmlDocument();
    dbSchema.LoadXml(xmlToTransform);
    XPathNavigator xpath = dbSchema.CreateNavigator();

    // Apply the transform.
    XslCompiledTransform styleSheet = new XslCompiledTransform(true);
    styleSheet.Load(reader);
    styleSheet.Transform(xpath, null, writer, null);

  }
  catch(System.Exception ex)
  {
    #if DEBUG
    System.Diagnostics.Debugger.Break();
    #endif
    throw ex;
  }
  finally
  {
    if (reader != null) reader.Close();
    if (writer != null) writer.Close();
  }

  return sw.ToString();

}





硬编码路线。

这可以简单如下:





The "hard coded" route.
This can be as simple as the following:

ExtractToFile(@"c:\someDirectory\geneInfo.xml",
              @"c:\someDirectory\geneInfo.txt",
              "entry", "type", "gene", "name");



其中ExtractToFile看起来像这样....




Where ExtractToFile looks like this....

/// <summary>
/// Extract the value of the specified attribute for elements of the
/// specified name where a search attribute has a specific value.
/// </summary>
/// <param name="inFile">full path to source xml</param>
/// <param name="outFile">full path spec of file to create</param>
/// <param name="elementName">The element to find in the XML</param>
/// <param name="attributeName">The search/filter attribute</param>
/// <param name="attributeValue">The required search/filter attribute value.</param>
/// <param name="attributeOut">The attribute for which we want the value.</param>
public static void ExtractToFile(string inFile,
                                 string outFile,
                                 string elementToFind,
                                 string attributeName,
                                 string attributeValue,
                                 string attributeOut) {

  // XML is case sensitive, but we're not.
  StringComparison ignoreCase = StringComparison.InvariantCultureIgnoreCase;

  // Decide how often we're going to dump output from buffer to disk.
  int rowCount   = 0;
  int flushCount = 1000;

  if (File.Exists(outFile)) {
    File.Delete(outFile);
  }

  using (StreamWriter output = new StreamWriter(outFile)) {

    // We assume the file exists and that the contents are valid XML
    // An XMLReader instance will work through all the nodes in the XML from the
    // start to the end. All we do is sit and wait for the elements we're
    // interested in to come floating past and deal with with them as they do.
    using (XmlReader fileReader = XmlReader.Create(inFile))

      while(fileReader.Read()) {
        if( fileReader.NodeType == XmlNodeType.Element &&
            fileReader.Name.Equals(elementToFind, ignoreCase) &&
            fileReader.HasAttributes) {

          string _find = fileReader.GetAttribute(attributeName);
          string _out  = fileReader.GetAttribute(attributeOut);

          if (_find.Equals(attributeValue, ignoreCase)) {
            output.WriteLine(_out);
            if (rowCount == flushCount){
              rowCount = 0;
              output.Flush();
          }
        }
      }
    }
  }
}

我相信XMLReader仅限于2GB文件。如果你的文件大于这个,那么你将要考虑其他解决方案。这里可能是一个好的起点



以SAX速度解析XML而不使用DOM或SAX [ ^ ]

I believe that XMLReader is limited to 2GB files. If your files are larger than this you are going to
have to consider alternative solutions. Here might be a good place to start

Parse XML at SAX Speed without DOM or SAX[^]


这篇关于如何在XML文件中搜索字符串并将其写入文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆