去除一些文本对的差异 [英] removing differencies of some text pairs

查看:28
本文介绍了去除一些文本对的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几天前我问了一个关于在 2 个文本文件中标记差异的问题,并很快得到了回答.

现在我有一个相当相似的问题,但有点复杂.我有 2 对文件,具有以下特征:pair1: (File1.txt, File2.txt)pair2: (File3.txt, File4.txt)

这些对中的每个文件之间都有一行一行的对应关系.假设 File1.txt 和 File3.txt 是一些英文单词,File2.txt 和 File4.txt 分别是它们的阿拉伯语和法语翻译.此外,File1.txt 和 File3.txt 非常相似(在某些情况下是相同的).

<预><代码>文件 1.txt 文件 2.txtEnWord1 ArTrans1EnWord2 ArTrans2EnWord3 ArTrans3Enword4 ArTrans4File3.txt File4.txtEnWord1 FrTrans1EnWord3 FrTrans3Enword4 FrTrans4Enword5 FrTrans5

现在我要做的就是比较这些对的英文边,找出两个文件(EnWord1、EnWord3和EnWord4)中出现的常用词,并过滤掉它们对应的翻译.简而言之,我可以说使用两本双语英阿和英法词典,我正在尝试构建一个 3 语英阿法词典.怎么可能?

我要补充的是,由于有很多这样的对(字典存储在不同的文件中,每个文件都包含一部分单词,由于某些原因无法合并文件然后处理它们)速度代码非常重要,我正在寻找一种快速的方法来做到这一点.

最后,请给我一些要点(甚至可能是完整的代码)在 Perl 中做到这一点.

此致,哈基姆

解决方案

我假设您要维护的顺序遵循 File1.txt.以下 perl 应该可以完成您的需求:

#!/usr/bin/perl使用严格;使用警告;my @pair1 = `paste -d ":" $ARGV[0] $ARGV[1]`;my @pair2 = `paste -d ":" $ARGV[2] $ARGV[3]`;我的@pairs = (@pair1, @pair2);我的 (%seen, @dups);foreach (@pairs){我的 $word = (split ":", $_)[0];推@dups, $word 如果 $seen{$word}++;}打开 (FILE0, ">", "NEW_File0.txt") 或死亡;打开 (FILE1, ">", "NEW_File1.txt") 或死亡;打开 (FILE2, ">", "NEW_File2.txt") 或死亡;foreach 我的 $duplicate (@dups){打印 FILE0 "$duplicate\n";foreach (@pair1) { print FILE1 ((split ":", $_)[1]) if $_ =~/^$duplicate:/;}foreach (@pair2) { print FILE2 ((split ":", $_)[1]) if $_ =~/^$duplicate:/;}}关闭 FILE0;关闭 FILE1;关闭 FILE2;

像这样运行:

./script.pl File1.txt File2.txt File3.txt File4.txt

<小时>

grep "" NEW_File* 结果:

NEW_File0.txt:EnWord1NEW_File0.txt:EnWord3NEW_File0.txt:EnWord4NEW_File1.txt:ArTrans1NEW_File1.txt:ArTrans3NEW_File1.txt:ArTrans4NEW_File2.txt:FrTrans1NEW_File2.txt:FrTrans2NEW_File2.txt:FrTrans3

可能不是最有效的做事方式,但至少应该给你一个开始的地方.哈.

Some days ago I asked a question about tagging differencies in 2 text files, and was answered quickly.

now I have a rather similar question but a bit more complicated. I have 2 pair of files by the following characteristics: pair1: (File1.txt , File2.txt) pair2: (File3.txt , File4.txt)

There is a line by line correspondence between each files in these pairs. say that File1.txt and File3.txt are some English words, and File2.txt and File4.txt are their Arabic and French translations respectively. In addition, File1.txt and File3.txt are very similar (and in some cases the same).


    File1.txt       File2.txt
    EnWord1         ArTrans1
    EnWord2         ArTrans2
    EnWord3         ArTrans3
    Enword4         ArTrans4

    File3.txt       File4.txt
    EnWord1         FrTrans1
    EnWord3         FrTrans3
    Enword4         FrTrans4
    Enword5         FrTrans5

Now what I want to do is to compare English sides of these pairs, find the common words that appear in both files (EnWord1,EnWord3, and EnWord4) and filter out their corresponding translations. In short, I can say that using two bilingual English-Arabic and English French dictionaries, I am trying to build a 3-lingual English-Arabic-French dictionary. How it is possible?

I have to add that since there are many such pairs (the dictionaries are stored in different files, each file contains a part of the words, and by some reasons it is not possible to merge files and then process them) the speed of the code is very important and I am looking for a fast way to do this.

Finally, please give me some points (or even possible the complete code) to do this in Perl.

Best regards, Hakim

解决方案

I assume that the order you would like to maintain follows File1.txt. The following perl should accomplish what your looking for:

#!/usr/bin/perl

use strict;
use warnings;

my @pair1 = `paste -d ":" $ARGV[0] $ARGV[1]`;
my @pair2 = `paste -d ":" $ARGV[2] $ARGV[3]`;

my @pairs = (@pair1, @pair2);
my (%seen, @dups);

foreach (@pairs)
{
  my $word = (split ":", $_)[0];
  push @dups, $word if $seen{$word}++;
}

open (FILE0, ">", "NEW_File0.txt") or die;
open (FILE1, ">", "NEW_File1.txt") or die;
open (FILE2, ">", "NEW_File2.txt") or die;

foreach my $duplicate (@dups)
{
  print FILE0 "$duplicate\n";

  foreach (@pair1) { print FILE1 ((split ":", $_)[1]) if $_ =~ /^$duplicate:/; }
  foreach (@pair2) { print FILE2 ((split ":", $_)[1]) if $_ =~ /^$duplicate:/; }
}

close FILE0;
close FILE1;
close FILE2;

Run like this:

./script.pl File1.txt File2.txt File3.txt File4.txt


grep "" NEW_File* results:

NEW_File0.txt:EnWord1
NEW_File0.txt:EnWord3
NEW_File0.txt:EnWord4
NEW_File1.txt:ArTrans1
NEW_File1.txt:ArTrans3
NEW_File1.txt:ArTrans4
NEW_File2.txt:FrTrans1
NEW_File2.txt:FrTrans2
NEW_File2.txt:FrTrans3

May not be the most efficient way to do things, but should give you somewhere to start at least. HTH.

这篇关于去除一些文本对的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆