如何分隔“句子"中的单词有空格吗? [英] How to separate words in a "sentence" with spaces?

查看:23
本文介绍了如何分隔“句子"中的单词有空格吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望在 JasperServer 中自动创建域.域是用于创建临时报告的数据视图".列的名称必须以人类可读的方式呈现给用户.

Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion.

有超过 2,000 条可能的数据,组织理论上可能希望将这些数据包含在报告中.数据来自非人类友好名称,例如:

There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as:

payperiodmatchcode劳动力分配编码依赖关系行动结束选项actionendoptiondesc 地址类型地址类型描述历史类型psaddresstype 角色名银行账户状态银行账户状态de​​sc银行账户类型银行账户类型描述受益人金额受益人类别受益人百分比福利子类受益人类beneficiaryclassdesc 受益人行动代码福利行动编码利益控制福利控制表年龄限制年龄控制通知期

payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod

问题

您将如何自动将此类名称更改为:

Question

How would you automatically change such names to:

  • 支付期匹配代码
  • 劳动力分配代码说明
  • 依赖关系
  • 使用 Google 的 你是说 引擎,但我认为它违反了他们的 TOS:

  • Use Google's Did you mean engine, however I think it violates their TOS:

lynx -dump «url» |grep "你的意思是" |awk ...

任何语言都可以,但像 Perl 这样的文本解析器可能非常适合.(列名只有英文.)

Any language is fine, but text parsers such as Perl would probably be well-suited. (The column names are English-only.)

目标不是100%完美地分解单词;以下结果是可以接受的:

The goal is not 100% perfection in breaking words apart; the following outcome is acceptable:

  • enrollment Effectivedate -> 注册生效日期
  • enrollmentenddate -> 注册男性趋势日期
  • enrollmentrequirementset -> 注册要求集

无论如何,人类都需要仔细检查结果并纠正许多结果.将一组 2,000 个结果减少到 600 个编辑将大大节省时间.专注于某些具有多种可能性的案例(例如,治疗师的名字)完全错过了重点.

No matter what, a human will need to double-check the results and correct many. Whittling a set of 2,000 results down to 600 edits would be a dramatic time savings. To fixate on some cases having multiple possibilities (e.g., therapistname) is to miss the point altogether.

推荐答案

有时,暴力破解 是可以接受的:

Sometimes, bruteforcing is acceptable:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack
";
}

输出:

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

另请参见拼写检查器曾经是软件工程的一大壮举.

这篇关于如何分隔“句子"中的单词有空格吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆