合并两个 yml 文件不处理重复项? [英] Merge two yml files does not handle duplicates?

查看:28
本文介绍了合并两个 yml 文件不处理重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Hash::Merge perl 模块合并 2 个 yml 文件.并尝试使用 YMAL 模块中的 Dump 将其转储到 yml 文件.

使用严格;使用警告;使用 Hash::Merge qw(merge);Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');使用 File::Slurp qw(write_file);使用 YAML;我的 $yaml1 = $ARGV[0];我的 $yaml2 = $ARGV[1];我的 $yaml_output = $ARGV[2];我的 $clkgrps = &YAML::LoadFile($yaml1);我的 $clkgrps1 = &YAML::LoadFile($yaml2);我的 $clockgroups = 合并($clkgrps1,$clkgrps);我的 $out_yaml = 转储 $clockgroups;write_file($yaml_output, { binmode => ':raw' }, $out_yaml);

合并 yml 文件后,我可以看到重复的条目,即以下两个 yml 文件中的内容相同.合并时,它将它们视为不同的条目.我们有处理重复的隐式方法吗?

解决方案

从 YAML 文件中获取的数据结构通常包含键,其值为 arrayrefs 和 hashrefs.在您的测试用例中,这是键 test 的 arrayref.

那么像Hash::Merge这样的工具只能将hashrefs添加到属于同一个key的arrayref;它并不意味着比较数组元素,因为没有通用的标准.因此,您需要自己执行此操作以删除重复项,或将您选择的任何特定规则应用于数据.

处理这个问题的一种方法是在每个可能包含重复项的数组引用中序列化(如此字符串化)复杂的数据结构,以便能够构建一个以它们为键的散列,这是处理重复项的标准方法(使用 O(1) 复杂性,尽管可能有一个很大的常数).

在 Perl 中有多种方法可以序列化数据.我推荐 JSON::XS,作为一个非常快速的工具,输出可以被任何语言和工具使用.(但当然请研究其他人,这可能更适合您的确切需求.)

一个简单的完整示例,使用您的测试用例

使用严格;使用警告;使用功能说";使用数据::转储 qw(dd pp);使用 YAML;使用 JSON::XS;使用 Hash::Merge qw(merge);#Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');#这里无关die "Usage: $0 in-file1 in-file2 output-file\n";如果@ARGV != 3;我的 ($yaml1, $yaml2, $yaml_out) = @ARGV;我的 $hr1 = YAML::LoadFile($yaml1);我的 $hr2 = YAML::LoadFile($yaml2);我的 $merged = merge($hr2, $hr1);#say "merged: ", pp $merged;对于我的 $key (keys %$merged) {# 相同的键被覆盖我的 %uniq = 地图 { encode_json $_ =>1 } @{$merged->{$key}};# 用没有欺骗的那个覆盖arrayref$merged->{$key} = [ map { decode_json $_ } keys %uniq ];}dd $合并;# 保存最终结构...

更复杂的数据结构需要更明智的遍历;考虑为此使用工具.

使用打印的问题中所示的文件

<前>{测试 => [{ 目录 => "LIB_DIR", 名称 => "ObsSel.ktc", 项目 => "TOT" },{ 目录 => "MODEL_DIR", 名称 => "pipe.v", 项目 => "TOT" },{目录 => "PCIE_LIB_DIR",名称 => "pciechip.ktc",项目 => "PCIE_MODE",},{ directory => "NAME_DIR", name => "fame.v", project => "SINGH" },{ 目录 => "TREE_PROJECT", 名称 => "Syn.yml", 项目 => "TOT" },],}

(我使用 Data::Dump 来显示复杂的数据,因为它的简单性和默认紧凑输出.)

如果序列化和比较整个结构存在问题,请考虑使用某种摘要(校验和、散列).

另一种选择是比较数据结构,以便手动解决重复项.对于复杂数据结构的比较,我喜欢使用 Test::More,它非常适用于任何测试之外的单纯比较.但当然也有专用工具,例如 Data::Compare.


最后,不是像上面那样手动处理简单的 merge 的结果,而是可以使用 Hash::Merge::add_behavior_spec 然后然后让模块做这一切.有关如何使用此功能的具体示例,请参见例如这篇文章这篇文章这篇文章.

请注意,在这种情况下,您仍然需要编写所有代码来完成上述工作,但该模块确实让您无需掌握一些机制.

I am trying to merge 2 yml files using Hash::Merge perl module. And trying to Dump it to yml file using Dump from YMAL module.

use strict;
use warnings;
use Hash::Merge qw( merge );
Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');
use File::Slurp qw(write_file);
use YAML;
my $yaml1 = $ARGV[0];
my $yaml2 = $ARGV[1];
my $yaml_output = $ARGV[2];
my $clkgrps = &YAML::LoadFile($yaml1);
my $clkgrps1 = &YAML::LoadFile($yaml2);
my $clockgroups = merge($clkgrps1, $clkgrps);
my $out_yaml = Dump $clockgroups;
write_file($yaml_output, { binmode => ':raw' }, $out_yaml);

After merging yml file, I could see duplicate entries i.e. following content is same in two yml files. While merging it is treating them as different entries. Do we have any implicit way in handle duplicates?

解决方案

The data structures obtained from YAML files generally contain keys with values being arrayrefs with hashrefs. In your test case that's the arrayref for the key test.

Then a tool like Hash::Merge can only add the hashrefs to the arrayref belonging to the same key; it is not meant to compare array elements, as there aren't general criteria for that. So you need to do this yourself in order to prune duplicates, or apply any specific rules of your choice to data.

One way to handle this is to serialize (so stringify) complex data structures in each arrayref that may contain duplicates so to be able to build a hash with them being keys, which is a standard way to handle duplicates (with O(1) complexity, albeit possibly with a large constant).

There are a number of ways to serialize data in Perl. I'd recommend JSON::XS, as a very fast tool with output that can be used by any language and tool. (But please research others of course, that may suit your precise needs better.)

A simple complete example, using your test cases

use strict;
use warnings;
use feature 'say';
use Data::Dump qw(dd pp);

use YAML;
use JSON::XS;
use Hash::Merge qw( merge );
#Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');  # irrelevant here

die "Usage: $0 in-file1 in-file2 output-file\n" if @ARGV != 3;

my ($yaml1, $yaml2, $yaml_out) = @ARGV;

my $hr1 = YAML::LoadFile($yaml1);
my $hr2 = YAML::LoadFile($yaml2);
my $merged = merge($hr2, $hr1);
#say "merged: ", pp $merged;

for my $key (keys %$merged) {
    # The same keys get overwritten
    my %uniq = map { encode_json $_ => 1 } @{$merged->{$key}};
    
    # Overwrite the arrayref with the one without dupes
    $merged->{$key} = [ map { decode_json $_ } keys %uniq ];
}
dd $merged;

# Save the final structure...

More complex data structures require a more judicious traversal; consider using a tool for that.

With files as shown in the question this prints

{
  test => [
    { directory => "LIB_DIR", name => "ObsSel.ktc", project => "TOT" },
    { directory => "MODEL_DIR", name => "pipe.v", project => "TOT" },
    {
      directory => "PCIE_LIB_DIR",
      name => "pciechip.ktc",
      project => "PCIE_MODE",
    },
    { directory => "NAME_DIR", name => "fame.v", project => "SINGH" },
    { directory => "TREE_PROJECT", name => "Syn.yml", project => "TOT" },
  ],
}

(I use Data::Dump to show complex data, for its simplicity and default compact output.)

If there are issues with serializing and comparing entire structures consider using a digest (checksum, hashing) of some sort.

Another option altogether would be to compare data structures as they are in order to resolve duplicates, by hand. For comparison of complex data structures I like to use Test::More, which works very nicely for mere comparisons outside of any testing. But there are dedicated tools as well of course, like Data::Compare.


Finally, instead of manually processing the result of a naive merge, like above, one can code the desired behavior using Hash::Merge::add_behavior_spec and then have the module do it all. For specific examples of how to use this feature see for instance this post and this post and this post.

Note that in this case you still write all the code to do the job like above but the module does take some of the mechanics off of your hands.

这篇关于合并两个 yml 文件不处理重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆