如何分离的＆QUOT字;句子QUOT;用空格？ [英] How to separate words in a "sentence" with spaces?

查看：180 发布时间：2016/7/28 14:50:22 bash perl awk nlp text-segmentation

本文介绍了如何分离的＆QUOT字;句子QUOT;用空格？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

展望中自动创建的JasperServer域。域是用于创建即席报表数据的说法。列的名称必须以可读的方式psented用户$ P $。

Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion.

有来自该组织在理论上要包括在报表中的数据超过2000件可能。

There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as:

payperiodmatch code
  labordistribution codedesc
  dependentrelationship actionendoption
  actionendoptiondesc地址类型
  addresstypedesc historytype
  psaddresstype角色名
  bankaccountstatus
  bankaccountstatusdesc bankaccounttype
  bankaccounttypedesc beneficiaryamount
  beneficiaryclass beneficiarypercent
  benefitsubclass beneficiaryclass
  beneficiaryclassdesc benefitaction code
  benefitaction codedesc
  benefitagecontrol
  benefitagecontroldesc
  ageconrolagelimit
  ageconrolnoticeperiod

payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod

你会如何自动这样的名称更改为：

Question

How would you automatically change such names to:

支付周期匹配code

劳动力分布code递减

的依赖关系

使用谷歌的<一个href=\"http://www.google.co.uk/search?q=caseaction$c$c&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla%3aen-US%3aofficial&client=firefox-a#sclient=psy&hl=en&client=firefox-a&rls=org.mozilla%3aen-US%3Aofficial&q=labordistribution$c$cdesc&aq=f&aqi=&aql=&oq=labordistribution$c$cdesc&gs_rfai=&pbx=1&fp=1&bav=on.2,or.r_gc.r_pw.&cad=b\"相对=nofollow>您是不是要找引擎，但我认为它违反了他们的服务条款：

Use Google's Did you mean engine, however I think it violates their TOS:

猞猁突降«网址»| grep的你的意思是| AWK ...

任何语言是好的，但文本解析器如Perl很可能会非常适合。（列名英语只）。

Any language is fine, but text parsers such as Perl would probably be well-suited. (The column names are English-only.)

我们的目标是在不破拆开的话100％完美;下面的结果是可以接受的：

The goal is not 100% perfection in breaking words apart; the following outcome is acceptable:

enrollmenteffectivedate - >注册生效日期

enrollmentenddate - >登记男人往往日期

enrollmentrequirementset - >入学要求设置

无论什么时候，人类就需要仔细检查的结果和纠正很多。削了一组2000结果下降到600编辑将是一个巨大的节省时间。要注视的部分的有多种可能性（例如，therapistname）就是完全错过了点。案件

No matter what, a human will need to double-check the results and correct many. Whittling a set of 2,000 results down to 600 edits would be a dramatic time savings. To fixate on some cases having multiple possibilities (e.g., therapistname) is to miss the point altogether.

推荐答案

有时，的暴力破解是可以接受的：

Sometimes, bruteforcing is acceptable:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)\z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack\n";
}

输出：

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

又见曾经是软件工程的的一大壮举拼写检查。

这篇关于如何分离的＆QUOT字;句子QUOT;用空格？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何分离的＆QUOT字;句子QUOT;用空格？ [英] How to separate words in a "sentence" with spaces?

问题描述

Question

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

如何分离的＆QUOT字;句子QUOT;用空格？ [英] How to separate words in a &quot;sentence&quot; with spaces?

问题描述

Question

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

如何分离的＆QUOT字;句子QUOT;用空格？ [英] How to separate words in a "sentence" with spaces?

登录关闭