在一个大的文本文件替换长列表中的词 [英] Replace Long list Words in a big Text File

查看:169
本文介绍了在一个大的文本文件替换长列表中的词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个快速的方法有很大的文本文件的工作。



我有2个文件,
A大的文本文件(〜20Gb的)
和另一个文本文件,它包含的组合字〜1200万名单



我想找到所有的组合词在第一个文本文件,并与另一组合字代替它(组合字带下划线)



例如计算机信息>替换为>Computer_Information



我使用此代码,但性能很差(我在惠普G7服务器测试有了16GB内存和16核心)

 公共部分Form1类:形式
{
的HashSet<串GT;单词表=新的HashSet<串GT;();

私人无效loadComboWords()使用
{
(StreamReader的FF =新的StreamReader(txtComboWords.Text))
{
串线;
,而((行= ff.ReadLine())!= NULL)
{
wordlist.Add(线);
}
}
}

私人无效replacewords(参考字符串str)
{

的foreach(字符串中的单词列表WD )
{
// ReplaceEx(REF海峡,WD,wd.Replace(,_));
如果(str.IndexOf(WD)-1个)
str.Replace(WD,wd.Replace(,_));
}
}

私人无效button3_Click(对象发件人,EventArgs五)
{
串线;
使用(StreamReader的FREAD =新的StreamReader(txtFirstFile.Text))
{
串写文件= Path.GetFullPath(txtFirstFile.Text)+ Path.GetFileNameWithoutExtension(txtFirstFile.Text)+_ReplaceComboWords。文本;
的StreamWriter SW =新的StreamWriter(写文件);
长intPercent;
label3.Text =草签;
loadComboWords();

,而((行= fread.ReadLine())!= NULL)
{
replacewords(参考线);
sw.WriteLine(线);

intPercent =(fread.BaseStream.Position * 100)/ fread.BaseStream.Length;
Application.DoEvents();
label3.Text = intPercent.ToString();
}
sw.Close();
fread.Close();
label3.Text =完成;
}
}
}



任何想法做这个工作在合理的时间



感谢


解决方案

几个思路:




  1. 我认为这将是更有效的每一行拆分成词和如果每个几个单词出现在单词列表看看。 10查找在HashSet的比数以百万计的子串的搜索更好。如果你有复合关键字,进行适当的索引:包含所有真正的关键字一个包含发生在现实关键字的所有单个的词语,另一

  2. 也许,装载字符串到的StringBuilder 是更换更好。

  3. 之后更新进度,说10000行处理,而不是之后各一个。

  4. 在后台线程处理。它不会让它快很多,但应用程序会负责。

  5. 并行化代码,如杰里米建议。



更新



下面是演示的字索引想法的示例代码:

 静态无效ReplaceWords()
{
字符串查找inputfilename = NULL;
字符串outputFileName = NULL;

//这部词典每一个字,可以发现
//任何的关键词,以包含它的关键短语的列表映射。
IDictionary的<字符串的IList<串GT;> singleWordMap = NULL;

使用(VAR源=新的StreamReader(查找inputfilename))
{
使用(VAR的目标=新的StreamWriter(outputFileName))
{
串线;
,而((行= source.ReadLine())!= NULL)
{
//首先,我们每个分割行成一个字 - 搜索
变种单位singleWords = SplitIntoWords(线);

VAR的结果=新的StringBuilder(线);
//为行
的foreach每一个字(在singleWords VAR singleWord)
{
//检查是否在任何关键词的存在的话,我们应该替换
//如果是的话,拿到相关的关键字句原
的IList<名单;串GT; interestingKeyPhrases;
如果
继续(singleWordMap.TryGetValue(singleWord,出interestingKeyPhrases)!);

Debug.Assert的(interestingKeyPhrases =空&放大器;!&放大器; interestingKeyPhrases.Count大于0);

//然后处理每个关键字句
的foreach(在interestingKeyPhrases VAR interestingKeyphrase)
{
//并在加工线替换它,如果它存在
result.Replace(interestingKeyphrase,GetTargetValue(interestingKeyphrase));
}
}

//现在,节省加工线
target.WriteLine(结果);
}
}
}
}

私人静态字符串GetTargetValue(字符串interestingKeyword)
{
抛出新NotImplementedException() ;
}

静态的IEnumerable<串GT; SplitIntoWords(字符串的关键词)
{
抛出新NotImplementedException();
}



中的代码显示的基本思路:




  1. 我们分开这两个关键词和处理的行成当量单位可有效地比较:字

  2. 我们存储的字典,任何消息很快为我们提供了包含单词的所有关键字句引用。

  3. 然后我们应用原来的逻辑。然而,我们没有为所有12亿关键短语做,而是对于具有至少与处理线的单字路口关键短语的一个非常小的子集。



  4. 当我离开的执行剩下的给你。



    然而,该代码有几个问题:




    1. SplitIntoWords 必须真正正常化的话一定规范形式。这取决于所需的逻辑。在最简单的情况下,你可能会被罚款用空格字符分割和lowercasing。但它可能发生,你需要一个形态匹配 - 这将是困难(这是非常接近的全文搜索任务)

    2. 对于速度的原因,它可能。更好,如果 GetTargetValue 方法是为每一个关键词的所谓处理输入前一次。

    3. 如果有很多您的关键字句都一致的话,你仍然有额外的工作signigicant量。在这种情况下,你需要保持在关键短语关键字的位置,以便使用这个词的距离计算,排除不相关的关键字句在处理输入行。

    4. 另外,我不如果的StringBuilder 确定实际上是在这种特殊情况下更快。你应与的StringBuilder 字符串试验来找出真相。

    5. 这毕竟是一个样本。设计不是很好。我会考虑用统一的界面中提取一些类(如 KeywordsIndex )。


    i need a fast method to work with big text file

    i have 2 files, a big text file (~20Gb) and an another text file that contain ~12 million list of Combo words

    i want find all combo words in the first text file and replace it with an another Combo word (combo word with underline)

    example "Computer Information" >Replace With> "Computer_Information"

    i use this code, but performance is very poor (i test in Hp G7 Server With 16Gb Ram and 16 Core)

    public partial class Form1 : Form
    {
        HashSet<string> wordlist = new HashSet<string>();
    
        private void loadComboWords()
        {
            using (StreamReader ff = new StreamReader(txtComboWords.Text))
            {
                string line;
                while ((line = ff.ReadLine()) != null)
                {
                    wordlist.Add(line);
                }
            }
        }
    
        private void replacewords(ref string str)
        {
    
            foreach (string wd in wordlist)
            {
              //  ReplaceEx(ref str,wd,wd.Replace(" ","_"));
                if (str.IndexOf(wd) > -1)
                    str.Replace(wd, wd.Replace(" ", "_"));
            }
        }
    
        private void button3_Click(object sender, EventArgs e)
        {
            string line;
            using (StreamReader fread = new StreamReader(txtFirstFile.Text))
            {
                string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt";
                StreamWriter sw = new StreamWriter(writefile);
                long intPercent;
                label3.Text = "initialing";
                loadComboWords();
    
                while ((line = fread.ReadLine()) != null)
                {
                    replacewords(ref line);
                    sw.WriteLine(line);
    
                    intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length;
                    Application.DoEvents();
                    label3.Text = intPercent.ToString();
                }
                sw.Close();
                fread.Close();
                label3.Text = "Finished";
            }
        }
    }
    

    any idea to do this job in reasonable time

    Thanks

    解决方案

    Several ideas:

    1. I think it will be more efficient to split each line into words and look if each of several words appears in your word list. 10 lookups in a hashset is better than millions of searches of a substring. If you have composite keywords, make appropriate indexes: one that contains all single words that occur in the real keywords and another that contains all the real keywords.
    2. Perhaps, loading strings into StringBuilder is better for replacing.
    3. Update progress after, say 10000 lines processed, not after each one.
    4. Process in background threads. It won't make it much faster, but the app will be responsible.
    5. Parallelize the code, as Jeremy has suggested.

    UPDATE

    Here is a sample code that demonstrates the by-word index idea:

    static void ReplaceWords()
    {
      string inputFileName = null;
      string outputFileName = null;
    
      // this dictionary maps each single word that can be found
      // in any keyphrase to a list of the keyphrases that contain it.
      IDictionary<string, IList<string>> singleWordMap = null;
    
      using (var source = new StreamReader(inputFileName))
      {
        using (var target = new StreamWriter(outputFileName))
        {
          string line;
          while ((line = source.ReadLine()) != null)
          {
            // first, we split each line into a single word - a unit of search
            var singleWords = SplitIntoWords(line);
    
            var result = new StringBuilder(line);
            // for each single word in the line
            foreach (var singleWord in singleWords)
            {
              // check if the word exists in any keyphrase we should replace
              // and if so, get the list of the related original keyphrases
              IList<string> interestingKeyPhrases;
              if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases))
                continue;
    
              Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0);
    
              // then process each of the keyphrases
              foreach (var interestingKeyphrase in interestingKeyPhrases)
              {
                // and replace it in the processed line if it exists
                result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase));
              }
            }
    
            // now, save the processed line
            target.WriteLine(result);
          }
        }
      }
    }
    
    private static string GetTargetValue(string interestingKeyword)
    {
      throw new NotImplementedException();
    }
    
    static IEnumerable<string> SplitIntoWords(string keyphrase)
    {
      throw new NotImplementedException();
    }
    

    The code shows the basic ideas:

    1. We split both keyphrases and processed lines into equivalent units which may be efficiently compared: the words.
    2. We store a dictionary that for any word quickly gives us references to all keyphrases that contain the word.
    3. Then we apply your original logic. However, we do not do it for all 12 mln keyphrases, but rather for a very small subset of keyphrases that have at least a single-word intersection with the processed line.

    I'll leave the rest of the implementation to you.

    The code however has several issues:

    1. The SplitIntoWords must actually normalize the words to some canonical form. It depends on the required logic. In the simplest case you'll probably be fine with whitespace-character splitting and lowercasing. But it may happen that you'll need a morphological matching - that would be harder (it's very close to full-text search tasks).
    2. For the sake of the speed, it's likely to be better if the GetTargetValue method was called once for each keyphrase before processing the input.
    3. If a lot of your keyphrases have coinciding words, you'll still have a signigicant amount of extra work. In that case you'll need to keep the positions of keywords in the keyphrases in order to use word distance calculation to exclude irrelevant keyphrases while processing an input line.
    4. Also, I'm not sure if StringBuilder is actually faster in this particular case. You should experiment with both StringBuilder and string to find out the truth.
    5. It's a sample after all. The design is not very good. I'd consider extracting some classes with consistent interfaces (e.g. KeywordsIndex).

    这篇关于在一个大的文本文件替换长列表中的词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆