用Java解析一个大的不好形成的文件 [英] parsing a big big not well formed file with Java

查看:130
本文介绍了用Java解析一个大的不好形成的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须解决一个问题接近解析一个巨大的文件,如3 GB或更高。那么,这个文件是如何构造一个伪xml文件的:

 < docFileNo_1> 
< otherItems>< / otherItems>

< html>
< div = XXXpostag>
< / html>

< / docFileNo>
...其他文件...
< docFileNo_N>
< otherItems>< / otherItems>

< html>
< div = XXXpostag>
< / html>

< / docFileNo>

浏览网页我已经阅读了一些遇到问题的人管理文件,但他们建议我用NIO来映射一个文件。
所以我认为解决方案太广泛,可能会引发一个异常。所以我认为我的问题是解决2 doutbs:


  1. 如何及时读取
    3 GB的文本文件

  2. 如何有效解析

    的docFileNoxx中提取html,并将规则应用于
    html标记以提取
    标签。

所以..我试图通过这种方式来解决第一个问题:


  1. _reader = new BufferedReader(new
    FileReader(filePath))//创建一个
    文件缓冲区阅读器
  2. _currentLine = _reader.readLine();
    //我遍历读取文件
    逐行

  3. 对于每一行,我将行
    附加到一个String变量,直到遇到
    标记

  4. 所以在JSOUP和后CSS过滤器
    i中提取内容,并将其放在
    文件中。
  5. 那么25MB的提取过程平均需要大约88秒....
    所以我想要执行它。



    $ b


    $ b

    我可以执行我的提取吗?

    解决方案

    不管你做什么,都不要做(伪代码):

      String data =; 
    作为文件{
    data + = line;
    }

    但是使用StringBuilder:

      StringBuilder data = new StringBuilder(); 
    作为文件{
    data.append(line);

    return data.toString();此外,请考虑浏览文件并创建一个仅包含有趣部分的地图。此文来自人人IT网,请访问www.renren.it获取更多内容
    我假设你没有XML,只是看起来有点像这个,你给的例子就是内容的一个公平表示。

     地图< String,String> entries = new HashMap< String,String>(1000); 
    StringBuilder entryData = null;
    作为文件中的行{
    ,如果行以< docFileNo开头{
    docFileNo =从行中提取数字;
    }否则如果行以< div = XXXpostag>开头{
    //这个入口的内容从这里开始
    entryData = new StringBuilder();
    }否则,如果行以< / html>开头{
    //此条目的内容在此处结束
    //所以存储内容,并指示条目已完成
    //将数据设置为空
    entries.put(docFileNo ,entryData.toString);
    entryData = null;
    }否则,如果entryData不为空{
    //我们在一个条目中,因为数据不是空的,所以存储行
    entryData.append(line);




    $ p
    $ b

    映射只包含条目大小的字符串,有点容易处理。我认为你需要适应真实的数据,但这是你可以在半小时内测试的东西。

    线索是entryData。它不仅是构建1条目数据的StringBuilder,而且如果不是null,它也表示我们看到了一个起始条目标记(div),如果为null,我们看到了结束标记(< ; / html>)表示不需要存储下一行。



    我假设你想保留doc号码,而XXXposttag是不变的。

    这个逻辑可以使用扫描仪 class。


    I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:

    <docFileNo_1>
    <otherItems></otherItems>
    
    <html>
    <div=XXXpostag>
    </html>
    
    </docFileNo>
       ... others doc... 
    <docFileNo_N>
    <otherItems></otherItems>
    
    <html>
    <div=XXXpostag>
    </html>
    
    </docFileNo>
    

    Surfing the net i have read about some people that have encountered problem to manage files, but they suggest to me, to map a file with NIO. So i think that the solution is too expansive and could bring me thrown an exception. So i think that my problem is to resolve 2 doutbs:

    1. How to read efficiently in time the 3 GB text file
    2. How to parser efficiently the html extract from the docFileNoxx, and apply rules to the html's tag to extract the post of the tag.

    So.. I have try to resolve the first question on this way:

    1. _reader = new BufferedReader(new FileReader(filePath)) // create a buffer reader of file
    2. _currentLine = _reader.readLine(); // i iterate the file reading it line by line
    3. For every line, i append the lines to a String variable until encounter the tag
    4. so with JSOUP and post CSS filter i extract the content, and put it on file.

    Well the process of extraction of 25 MB, on average takes about 88 seconds.... So i would like to perform it.

    HOw I could perform my extraction??

    解决方案

    Whatever you do, don't do (pseudo code):

    String data = "";
    for line in file {
        data += line;
    }
    

    but use a StringBuilder:

    StringBuilder data = new StringBuilder();
    for line in file {
        data.append(line);
    }
    return data.toString();
    

    Further, consider walking through the file and create a map with only the interesting parts. I assume you don't have XML but something which only looks a bit like it, and the example you gave is a fair representation of the content.

    Map<String, String> entries = new HashMap<String,String>(1000);
    StringBuilder entryData = null;
    for line in file {
      if line starts with "<docFileNo" {
         docFileNo = extract number from line;
      } else if line starts with "<div=XXXpostag>" {
         // Content of this entry starts here
         entryData = new StringBuilder();
      } else if line starts with "</html>" {
         // content of this entry ends here
         // so store content, and indicate that the entry is finished by 
         // setting data to null
         entries.put(docFileNo, entryData.toString);
         entryData = null;
      } else if entryData is not null {
         // we're in an entry as data is not null, so store the line
         entryData.append(line);
      }
    }
    

    The map contains only entry-sized strings which makes them a bit easier to handle. I think you'd need to adapt it to the true data, but this is something which you could test in about half an hour.

    The clue is entryData. it is not only the StringBuilder in which the data of 1 entry is build, but if not-null it also indicates we saw a start entry marker (the div) and if null we saw the end marker (</html>) indicating the next lines need not be stored.

    I assumed you want to keep the doc number, and the XXXposttag is constant.

    An alternative implementation of this logic could be made using the Scanner class.

    这篇关于用Java解析一个大的不好形成的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆