读取和解析大型XML文件的性能问题 [英] Performance problem with read and parse large XML files

查看:48
本文介绍了读取和解析大型XML文件的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个目录,其中包含几个大型XML文件(总大小约为10 GB).有什么方法可以遍历包含XML文件的目录并读取50字节乘50字节并以高性能解析XML文件吗?

I have a directory which contains several large XML files (total size is about 10 GB). Is there any way to iterate through the directory containing the XML files and read 50 byte by 50 byte and parse the XML files with high performance?

func (mdc *Mdc) Loadxml(path string, wg sync.WaitGroup) {
    defer wg.Done()
    //var conf configuration
    file, err := os.Open(path)
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()
    scanner := bufio.NewScanner(file)
    buf := make([]byte, 1024*1024)
    scanner.Buffer(buf, 50)
    for scanner.Scan() {
        _, err := file.Read(buf)
        if err != nil {
            log.Fatal(err)
        }
    }

    err = xml.Unmarshal(buf, &mdc)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(mdc)
}

推荐答案

您可以做得更好:您可以标记xml文件.

You can do something even better: You can tokenize your xml files.

假设您有一个这样的xml

Say you have an xml like this

<inventory>
  <item name="ACME Unobtainium">
    <tag>Foo</tag>
    <count>1</count>
  </item>
  <item name="Dirt">
    <tag>Bar</tag>
    <count>0</count>
  </item>
</inventory>

您实际上可以拥有以下数据模型

you can actually have the following data model

type Inventory struct {
    Items []Item `xml:"item"`
}

type Item struct {
    Name  string   `xml:"name,attr"`
    Tags  []string `xml:"tag"`
    Count int      `xml:"count"`
}

现在,您所要做的就是使用 filepath.Walk 和对要处理的每个文件执行以下操作:

Now, all you have to do is to use filepath.Walk and do something like this for each file you want to process:

    decoder := xml.NewDecoder(file)

    for {
        // Read tokens from the XML document in a stream.
        t, err := decoder.Token()

        // If we are at the end of the file, we are done
        if err == io.EOF {
            log.Println("The end")
            break
        } else if err != nil {
            log.Fatalf("Error decoding token: %s", err)
        } else if t == nil {
            break
        }

        // Here, we inspect the token
        switch se := t.(type) {

        // We have the start of an element.
        // However, we have the complete token in t
        case xml.StartElement:
            switch se.Name.Local {

            // Found an item, so we process it
            case "item":
                var item Item

                // We decode the element into our data model...
                if err = decoder.DecodeElement(&item, &se); err != nil {
                    log.Fatalf("Error decoding item: %s", err)
                }

                // And use it for whatever we want to
                log.Printf("'%s' in stock: %d", item.Name, item.Count)

                if len(item.Tags) > 0 {
                    log.Println("Tags")
                    for _, tag := range item.Tags {
                        log.Printf("\t%s", tag)
                    }
                }
            }
        }
    }

使用虚拟XML的工作示例: https://play.golang.org/p/MiLej7ih9Jt

Working example with dummy XML: https://play.golang.org/p/MiLej7ih9Jt

这篇关于读取和解析大型XML文件的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆