逐句将CFG语法规则分组 [英] Grouping of CFG grammar rules sentencewise
问题描述
下面为每个句子生成指定的规则.我们必须为每个句子将它们分组.输入在文件中.输出也应该在文件中
Below specified rules are generated for each sentence. We have to group them for each sentence. The input is in file. Output also should be in file
sentenceid=2
NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP
NP--->N_NNP
NP--->N_NN_O_NU
VGF--->V_VM_VF
sentenceid=3
NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU
NP--->N_NN
VGF--->V_VM_VF
sentenceid=4
NP--->N_NNP
NP--->N_NN_S_NU
NP--->N_NNP_O_M
VGF--->V_VM_VF
以上部分包含input,实际上是每个句子的语法.我想明智地将相邻规则分组.输出应如下所示.
The above section containing input ,that is actually grammar for each sentence. I want to group adjacent rules sentence wise. Output should be like below.
sentenceid=2
NP--->N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
VGF--->V_VM_VF
sentenceid=3
NP--->N_NN
VGNF--->V_VM_VNF
JJP--->JJ
NP--->N_NN_S_NU N_NN
VGF--->V_VM_VF
senetnceid=4
NP--->N_NNP N_NN_S_NU N_NNP_O_M
VGF--->V_VM_VF
我该如何实施?我需要将近1000条句子规则来进行概率计算.这是每个句子的CFG语法,我想逐句将相邻的规则分组.
How can I implement this? I need almost 1000 sentences rules for probability calculation. This is the CFG grammar for each sentence, I want to group adjacent rules sentence-wise.
推荐答案
如何解决:考虑句子在不同文件中.
How about this: considering sentence are in different files.
#!/usr/bin/python
import re
marker = '--->'
def parse_it(sen):
total_dic = dict()
marker_memory = ''
with open(sen, 'r') as fh:
mem = None
lo = list()
for line in fh.readlines():
if line.strip():
match = re.search('(sentenceid=\d+)', line)
if match:
if mem and lo:
total_dic[marker_memory].append(lo)
marker_memory = match.group(0)
total_dic[marker_memory] = []
else:
k,v = line.strip().split(marker)
k,v = [ x.strip() for x in [k,v]]
if not mem or mem == k:
lo.append((k,v))
mem = k
else:
total_dic[marker_memory].append(lo)
lo = [(k,v)]
mem = k
#total_dic[marker_memory].append(lo)
return total_dic
dic = parse_it('sentence')
for kin,lol in dic.iteritems():
print
print kin
for i in lol:
k,v = zip(*i)
print '%s%s %s' % (k[0],marker,' '.join(v))
输出:
sentenceid=3
VGF---> V_VM_VF
NP---> N_NN
VGNF---> V_VM_VNF
JJP---> JJ
NP---> N_NN_S_NU N_NN
VGF---> V_VM_VF
sentenceid=2
NP---> N_NNP N_NN_S_NU N_NNP N_NNP N_NN_O_NU
VGF---> V_VM_VF
sentenceid=4
VGF---> V_VM_VF
NP---> N_NNP N_NN_S_NU N_NNP_O_M
这篇关于逐句将CFG语法规则分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!