需要构建自定义NER的方法,以便从任何格式的工资单中提取以下关键字 [英] Need approach on building Custom NER for extracting below keywords from any format of payslips
问题描述
我正在尝试从任何格式的工资单构建以下参数的通用提取:
I am trying to build a generic extraction of below parameters from any format of payslip:
- 名称
- 他的邮政编码
- 付款日期
- 净工资.
我面临的挑战是由于可能会出现多种格式,我想应用NER(Spacy)在实体下学习这些知识
Challenge I am facing is due to variety of format that may come, I want to apply NER (Spacy) to learn these under the entities
- 姓名-PERSON
- 他的邮政编码
- 付款日期-DATE
- 净薪. -钱
但是到目前为止,我还是没有成功,我什至尝试为邮编&创建自定义EntityMatcher.约会,但没有成功.
But I am unsuccess so far, I even tried to build a custom EntityMatcher for Postcode & Date but to no success.
我寻求任何指导方针和方法来使我走上实现上述要求的正确道路,关于在ML之下实现此要求的正确和最佳方法是什么.
I seek any guideline and approach to make me take the right path in achieving the above ask, as to what is the right and best approach under the ML to achieve this.
我尝试构建的自定义NER代码段
A snippet of Custom NER I tried to build
import spacy
import random
import threading
import time
from DateEntityMatcher import DateEntityMatcher
from PostCodeEntityMatcher import PostCodeEntityMatcher
class IncomeValidatorModel(object):
""" Threading example class
The run() method will be started and it will run in the background
until the application exits.
"""
def __init__(self, interval=1):
""" Constructor
:type interval: int
:param interval: Check interval, in seconds
"""
self.interval = interval
thread = threading.Thread(target=self.run, args=())
thread.daemon = True # Daemonize thread
thread.start() # Start the execution
def run(self):
""" Method that runs forever """
while True:
# Do something
print('Doing something important in the background')
DATA = [
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR M HASAN CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR K KHANA CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR M MENON CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR F JAHAN CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR A JAHAN CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR M HASAN CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR M HASAN CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR M HASAN CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
{'entities': [(203, 218, 'ORG'), (100, 106, 'PERSON'), (1097, 1103, 'MONEY')]}),
(u"Sample Payslip Matrix House Basing View Basingstoke Hampshire RG21 4FF Advantage Resourcing 6th Floor, Matrix House, Basing View, Basingstoke, Hampshire, RG21 4FF Registered Number 03341461 COMPANY DIVISION Advantage Resourcing UK SWINDON WORKER NO. NAME PERIOD PAY DATE IND 123456 Sample Payslip 14/2016 08/07/2016 W1 DEPARTMENT TAX CODE N.I. NO./TABLE LETTER NAT 1100L JA123456A/A PAYMENTS DEDUCTIONS Wk Ending Timesheet Description Units Rate Amount Deduction Amount 03/07/2016 GEN000499628 Hourly Rate 40.00 10.00 400.00 Tax 87.60 03/07/2016 GEN000499628 Week Day Overtime 10.00 15.00 150.00 NI 59.40 03/07/2016 GEN000499628 Saturday Overtime 5.00 20.00 100.00 TOTAL PAYMENTS 650.00 TOTAL DEDUCTIONS 147.00 CUMULATIVES GROSS TO DATE 650.00 Current Holiday Entitlement: 0.00 Unit(s) TAXABLE PAY TO DATE 650.00 EE PENSION TO DATE 0.00 ER PENSION TO DATE 0.00 TAX TO DATE 87.60 TO DATE 68.17 TO DATE 59.40 c Safe Computing Limited 2002 NET PAY 503.00",
{'entities': [(89, 109, 'ORG'), (0, 14, 'PERSON'), (1186, 1191, 'MONEY')]}),
(u"Mubssar Hasan Matrix House Basing View Basingstoke Hampshire RG21 4FF Advantage Resourcing 6th Floor, Matrix House, Basing View, Basingstoke, Hampshire, RG21 4FF Registered Number 03341461 COMPANY DIVISION Advantage Resourcing UK SWINDON WORKER NO. NAME PERIOD PAY DATE IND 123456 Sample Payslip 14/2016 08/07/2016 W1 DEPARTMENT TAX CODE N.I. NO./TABLE LETTER NAT 1100L JA123456A/A PAYMENTS DEDUCTIONS Wk Ending Timesheet Description Units Rate Amount Deduction Amount 03/07/2016 GEN000499628 Hourly Rate 40.00 10.00 400.00 Tax 87.60 03/07/2016 GEN000499628 Week Day Overtime 10.00 15.00 150.00 NI 59.40 03/07/2016 GEN000499628 Saturday Overtime 5.00 20.00 100.00 TOTAL PAYMENTS 650.00 TOTAL DEDUCTIONS 147.00 CUMULATIVES GROSS TO DATE 650.00 Current Holiday Entitlement: 0.00 Unit(s) TAXABLE PAY TO DATE 650.00 EE PENSION TO DATE 0.00 ER PENSION TO DATE 0.00 TAX TO DATE 87.60 TO DATE 68.17 TO DATE 59.40 c Safe Computing Limited 2002 NET PAY 503.00",
{'entities': [(88, 108, 'ORG'), (0, 13, 'PERSON'), (1186, 1191, 'MONEY')]}),
(u"Oracle Corp Anil Menon Work Date 01/09/2019 PAYMENTS Tax 100 Net Pay 2000",
{'entities': [(0, 10, 'ORG'), (12, 21, 'PERSON'), (69, 72, 'MONEY')]}),
(u"Huawei Corp Anil Menon Work Date 01/06/2019 PAYMENTS Tax 100 Net Pay 1900",
{'entities': [(0, 10, 'ORG'), (12, 21, 'PERSON'), (69, 72, 'MONEY')]}),
(u"Tata Corp Nitin Garg Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 1900",
{'entities': [(0, 8, 'ORG'), (10, 19, 'PERSON'), (67, 70, 'MONEY')]}),
(u"Accenture Corp Amol Joshi Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 900",
{'entities': [(0, 15, 'ORG'), (17, 26, 'PERSON'), (72, 74, 'MONEY')]}),
(u"Cognizant Corp Anup Nair Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 900",
{'entities': [(0, 15, 'ORG'), (17, 25, 'PERSON'), (71, 73, 'MONEY')]}),
(u"Cognizant Corp Sajit Kumar Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 1900",
{'entities': [(0, 15, 'ORG'), (17, 27, 'PERSON'), (73, 76, 'MONEY')]}),
(u"Tata Corp Saurabh Dave Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 1300",
{'entities': [(0, 8, 'ORG'), (10, 21, 'PERSON'), (69, 72, 'MONEY')]}),
(u"Capgemini PLC Mubashshir Hasan Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 1700",
{'entities': [(0, 12, 'ORG'), (14, 29, 'PERSON'), (77, 80, 'MONEY')]}),
(u"Capgemini PLC Sagar Pande Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 1700",
{'entities': [(0, 12, 'ORG'), (14, 24, 'PERSON'), (72, 75, 'MONEY')]}),
(u"Capgemini PLC Sreeram Yegappan Work Date 20/04/2019 PAYMENTS Tax 100 Net Pay 2000",
{'entities': [(0, 12, 'ORG'), (14, 29, 'PERSON'), (77, 80, 'MONEY')]})
]
# nlp = spacy.blank('en') # new, empty model. Let’s say it’s for the English language
global nlp
nlp = spacy.load('en_core_web_sm')
nlp.entity.add_label('ORG')
nlp.entity.add_label('PERSON')
nlp.entity.add_label('MONEY')
# add NER pipeline
# ner = nlp.create_pipe('ner') # our pipeline would just do NER
# nlp.add_pipe(ner, last=True) # we add the pipeline to the model
postcde_entity_matcher = PostCodeEntityMatcher(nlp, ['NN1 3LE', 'NN2 8HF', 'IG3 8TH', 'NN4 7YH', 'RG21 5GH'], 'POSTCDE')
nlp.entity.add_label('POSTCDE')
nlp.add_pipe(postcde_entity_matcher, before='ner')
date_entity_matcher = DateEntityMatcher(nlp, ['20/04/2019','20/04/2019', '25/04/2016', '20/04/2019', '20/07/2019', '20/12/2019'], 'DATE')
nlp.entity.add_label('DATE')
nlp.add_pipe(date_entity_matcher, before='ner')
optimizer = nlp.begin_training()
for i in range(11):
random.shuffle(DATA)
for text, annotations in DATA:
nlp.update([text], [annotations], sgd=optimizer)
time.sleep(self.interval)
def extractPayslipData(self, data):
doc = nlp(data)
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
return doc.ents
推荐答案
训练json(x.json)应该是这样的:-
[{
"text": "PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR M HASAN CAPGEMINI UK PLC EMP REFERENCE TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53",
"entities": [
[
191,
198,
"PERSON"
],
[
202,
211,
"ORG"
],
[
150,
157,
"POST_CODE"
],
[
1096,
1103,
"MONEY"
]]
},
{
"text": "Mubssar Hasan Matrix House Basing View Basingstoke Hampshire RG21 4FF Advantage Resourcing 6th Floor, Matrix House, Basing View, Basingstoke, Hampshire, RG21 4FF Registered Number 03341461 COMPANY DIVISION Advantage Resourcing UK SWINDON WORKER NO. NAME PERIOD PAY DATE IND 123456 Sample Payslip 14/2016 08/07/2016 W1 DEPARTMENT TAX CODE N.I. NO./TABLE LETTER NAT 1100L JA123456A/A PAYMENTS DEDUCTIONS Wk Ending Timesheet Description Units Rate Amount Deduction Amount 03/07/2016 GEN000499628 Hourly Rate 40.00 10.00 400.00 Tax 87.60 03/07/2016 GEN000499628 Week Day Overtime 10.00 15.00 150.00 NI 59.40 03/07/2016 GEN000499628 Saturday Overtime 5.00 20.00 100.00 TOTAL PAYMENTS 650.00 TOTAL DEDUCTIONS 147.00 CUMULATIVES GROSS TO DATE 650.00 Current Holiday Entitlement: 0.00 Unit(s) TAXABLE PAY TO DATE 650.00 EE PENSION TO DATE 0.00 ER PENSION TO DATE 0.00 TAX TO DATE 87.60 TO DATE 68.17 TO DATE 59.40 c Safe Computing Limited 2002 NET PAY 503.00",
"entities": [
[
1,
13,
"PERSON"
],
[
88,
108,
"ORG"
],
[
150,
157,
"POST_CODE"
],
[
1186,
1192,
"MONEY"
]]
}
]
代码:-
with open(training_pickel_file) as input:
TRAIN_DATA = json.load(input)
def main(model=None, output_dir="/home/NLP/model", n_iter=50):
if model is not None:
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe('ner')
for annotations in TRAIN_DATA:
for ent in annotations["entities"]:
ner.add_label(ent[2])
print(ner)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for a in TRAIN_DATA:`
doc = nlp.make_doc(a["text"])
gold = GoldParse(doc, entities = a["entities"])
nlp.update([doc], [gold], drop =0.5, sgd=optimizer, losses = losses)
print('Losses', losses)
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
模型测试:-
sen = ["""PRIVATE & CONFIDENTIAL REF. No. DEPT SITE PAY DATE 82521 002 31/07/2019 MR M HASAN 69 ALCOMBE ROAD NORTHAMPTON UK NN1 3LE CONFIDENTIAL PAY ADVICE MR M HASAN CAPGEMINI UK PLC EMP REFERENCE COME A TAXDISTRICT TAXREFERENCE D83/82521 475/VB53759 TAXABLE PAY 14297.14 AY DATE 31/07/2019 TAX PERIOD 2019-04 ANN. SALARY 49650.00 TAX PAID 1611.40 PAY METHOD BACS TAX CODE 1871L PAY PERIOD MONTHLY N.I. EMPLOYEE 1365.96 N.I. NUMBER SY095026C CONTRACT HRS 40.00 PERIOD PAY 4137.50 N.I. EMPLOYER 1576.11 N.I. TABLE A O/TIME RATE 23.8702 HOURLY RATE 23.8702 PAYMENTS DEDUCTIONS DESCRIPTION HRS/UNITS RATE VALUE TO DATE DESCRIPTION VALUE BAL ANCE TO DATE BENEFIT ALLOW 620.67 706.61 NAT.INS 385.84 1365.96 DISP NT -353.08 -1253.08 P.A.Y.E. 474.80 1611.40 SALARY 4137.50 16514.38 ACCOM NT -470.77 -1670.77 GROSS PAY 4758.17 TOTAL DEDUCTIONS 860.64 NET PAY 3897.53"""]
for text in sen:
doc = nlp(text)
entity = {}
for ent in doc.ents:
list_of_ent = []
list_of_ent.append(ent.text)
entity.update({ent.label_: list_of_ent})
print(entity)
结果:-
这篇关于需要构建自定义NER的方法,以便从任何格式的工资单中提取以下关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!