如何利用单字数据集对客户进行Spacy方面的培训? [英] How to train custom NER in Spacy with single words data set?

查看:38
本文介绍了如何利用单字数据集对客户进行Spacy方面的培训?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在试着用新的实体‘动物’来培训一名客户。但我有一个单字数据集,如下所示:

TRAIN_DATA = [("Whale_ Blue", {"entities": [(0,11,LABEL)]}), ("Shark_ whale", {"entities": [(0,12,LABEL)]}), ("Elephant_ African", {"entities": [(0,17,LABEL)]}), ("Elephant_ Indian", {"entities": [(0,16,LABEL)]}), ("Giraffe_ male", {"entities": [(0,13,LABEL)]}), ("Mule", {"entities": [(0,4,LABEL)]}), ("Camel", {"entities": [(0,5,LABEL)]}), ("Horse", {"entities": [(0,5,LABEL)]}), ("Cow", {"entities": [(0,3,LABEL)]}), ("Dolphin_ Bottlenose", {"entities": [(0,19,LABEL)]}), ("Donkey", {"entities": [(0,6,LABEL)]}), ("Tapir", {"entities": [(0,5,LABEL)]}), ("Shark_ Hammerhead", {"entities": [(0,17,LABEL)]}), ("Seal_ fur", {"entities": [(0,9,LABEL)]}), ("Manatee", {"entities": [(0,7,LABEL)]}), ("Bear_ Grizzly", {"entities": [(0,13,LABEL)]}), ("Alligator_ American", {"entities": [(0,19,LABEL)]}), ("Sturgeon_ Atlantic", {"entities": [(0,18,LABEL)]}), ("Lion", {"entities": [(0,4,LABEL)]}), ("Bear_ American Black", {"entities": [(0,20,LABEL)]}), ("Ostrich", {"entities": [(0,7,LABEL)]}), ("Crocodile_ Saltwater", {"entities": [(0,20,LABEL)]}), ("Pig", {"entities": [(0,3,LABEL)]}), ("Sheep", {"entities": [(0,5,LABEL)]}), ("Dog_ Saint Bernard", {"entities": [(0,18,LABEL)]}), ("Human", {"entities": [(0,5,LABEL)]}), ("Deer_ white-tailed", {"entities": [(0,18,LABEL)]}), ("Tuna", {"entities": [(0,4,LABEL)]}), ("Salamander_ Japanese", {"entities": [(0,20,LABEL)]}), ("Carp", {"entities": [(0,4,LABEL)]}), ("Dog_ Foxhound", {"entities": [(0,13,LABEL)]}), ("Goat_ Milch", {"entities": [(0,11,LABEL)]}), ("Sting Ray", {"entities": [(0,9,LABEL)]}), ("Dog_ Pointer", {"entities": [(0,12,LABEL)]}), ("Kangaroo_ Red", {"entities": [(0,13,LABEL)]}), ("Cod_ Atlantic", {"entities": [(0,13,LABEL)]}), ("Dog_ Collie", {"entities": [(0,11,LABEL)]}), ("Pike_ Northern", {"entities": [(0,14,LABEL)]}), ("Trout_ brown", {"entities": [(0,12,LABEL)]}), ("Dog_ Basset Hound", {"entities": [(0,17,LABEL)]}), ("Turkey", {"entities": [(0,6,LABEL)]}), ("Porcupine", {"entities": [(0,9,LABEL)]}), ("Trout_ Rainbow", {"entities": [(0,14,LABEL)]}), ("Gar_ longnose", {"entities": [(0,13,LABEL)]}), ("Beaver", {"entities": [(0,6,LABEL)]}), ("Dog_ Irish Terrier", {"entities": [(0,18,LABEL)]}), ("Dog_ Beagle", {"entities": [(0,11,LABEL)]}), ("Bass_ Large Mouth Black", {"entities": [(0,23,LABEL)]}), ("Dog_ Whippet", {"entities": [(0,12,LABEL)]}), ("Dog_ Boston Terrier", {"entities": [(0,19,LABEL)]}), ("Nutria", {"entities": [(0,6,LABEL)]}), ("Dog_ Fox Terrier", {"entities": [(0,16,LABEL)]}), ("Armadillo_ Nine-banded", {"entities": [(0,22,LABEL)]}), ("Fox_ Arctic", {"entities": [(0,11,LABEL)]}), ("Woodchuck (Groundhog)", {"entities": [(0,21,LABEL)]}), ("Rabbit_ Domestic", {"entities": [(0,16,LABEL)]}), ("Chicken", {"entities": [(0,7,LABEL)]}), ("Dog_ Pekingese", {"entities": [(0,14,LABEL)]}), ("Haddock", {"entities": [(0,7,LABEL)]}), ("Cat_ domestic", {"entities": [(0,13,LABEL)]}), ("Salmon_ Chum", {"entities": [(0,12,LABEL)]}), ("Vulture_ Turkey", {"entities": [(0,15,LABEL)]}), ("Opossum_ Large American", {"entities": [(0,23,LABEL)]}), ("Flounder_ Winter", {"entities": [(0,16,LABEL)]}), ("Pheasant_ Ringnecked", {"entities": [(0,20,LABEL)]}), ("Perch", {"entities": [(0,5,LABEL)]}), ("Duck_ Mallard", {"entities": [(0,13,LABEL)]}), ("Mackerel_ Spanish", {"entities": [(0,17,LABEL)]}), ("Platypus_ Duck-billed", {"entities": [(0,21,LABEL)]}), ("Sea lamprey", {"entities": [(0,11,LABEL)]}), ("Bullhead_ Brown", {"entities": [(0,15,LABEL)]}), ("Mink_ American", {"entities": [(0,14,LABEL)]}), ("Falcon_ Peregrin", {"entities": [(0,16,LABEL)]}), ("Goshawk", {"entities": [(0,7,LABEL)]}), ("Bat_ Flying fox", {"entities": [(0,15,LABEL)]}), ("Duck_ Wood", {"entities": [(0,10,LABEL)]}), ("Buzzard", {"entities": [(0,7,LABEL)]}), ("Bass_ Rock", {"entities": [(0,10,LABEL)]}), ("Squirrel_ Gray", {"entities": [(0,14,LABEL)]}), ("Guinea Pig", {"entities": [(0,10,LABEL)]}), ("Rat_ Norway", {"entities": [(0,11,LABEL)]}), ("Gull_ Herring", {"entities": [(0,13,LABEL)]}), ("Crow_ Hooded", {"entities": [(0,12,LABEL)]}), ("Rook", {"entities": [(0,4,LABEL)]}), ("Pumpkinseed", {"entities": [(0,11,LABEL)]}), ("Pigeon", {"entities": [(0,6,LABEL)]}), ("Guinea fowl", {"entities": [(0,11,LABEL)]}), ("Quail_ Bobwhite", {"entities": [(0,15,LABEL)]}), ("Magpie_ Black-billed", {"entities": [(0,20,LABEL)]}), ("European Jackdaw", {"entities": [(0,16,LABEL)]}), ("Hamster", {"entities": [(0,7,LABEL)]}), ("Kestrel_ lesser", {"entities": [(0,15,LABEL)]}), ("Hawk_ Night", {"entities": [(0,11,LABEL)]}), ("Chipmunk_ Eastern", {"entities": [(0,17,LABEL)]}), ("Bat_ little brown", {"entities": [(0,17,LABEL)]}), ("Starling_ Common", {"entities": [(0,16,LABEL)]}), ("Frog_ leopard", {"entities": [(0,13,LABEL)]}), ("Weasel_ least", {"entities": [(0,13,LABEL)]}), ("Mouse_ White-footed", {"entities": [(0,19,LABEL)]}), ("Mouse_ House", {"entities": [(0,12,LABEL)]}), ("Canary", {"entities": [(0,6,LABEL)]}), ("Hummingbird", {"entities": [(0,11,LABEL)]}), ("Hummingbird_ Cuban bee", {"entities": [(0,22,LABEL)]}), ("Shrew_ Musked", {"entities": [(0,13,LABEL)]}), ("Shrew_ dwarf", {"entities": [(0,12,LABEL)]}), ("Goby_ Philippine", {"entities": [(0,16,LABEL)]}), ("Goldfish", {"entities": [(0,8,LABEL)]}), ("Toad_ American", {"entities": [(0,14,LABEL)]}), ("Frog_ Bull", {"entities": [(0,10,LABEL)]}), ("Eel_ American", {"entities": [(0,13,LABEL)]}), ("Penguin_ Adelie", {"entities": [(0,15,LABEL)]}), ("Robin", {"entities": [(0,5,LABEL)]}), ("Kiwi", {"entities": [(0,4,LABEL)]}), ("Fighting Fish_ Siamese", {"entities": [(0,22,LABEL)]}), ("Skate", {"entities": [(0,5,LABEL)]}), ("Quail_ Japanese/European", {"entities": [(0,24,LABEL)]}), ("Gila Monster", {"entities": [(0,12,LABEL)]}), ("Chameleon", {"entities": [(0,9,LABEL)]}), ("Cobra_ Indian", {"entities": [(0,13,LABEL)]}), ("Boa Constrictor", {"entities": [(0,15,LABEL)]}), ("Guppy", {"entities": [(0,5,LABEL)]}), ("Salamander_ Tiger", {"entities": [(0,17,LABEL)]}), ("Swordtail_ Mexican", {"entities": [(0,18,LABEL)]}), ("Stickleback_ three spine", {"entities": [(0,24,LABEL)]}), ("Sea horse", {"entities": [(0,9,LABEL)]}), ("Hellbender", {"entities": [(0,10,LABEL)]}), ("Herring_ Atlantic", {"entities": [(0,17,LABEL)]}), ("Chameleon_ Madagascar", {"entities": [(0,21,LABEL)]}), ("Frog_ Cuban", {"entities": [(0,11,LABEL)]}), ]

我已经使用了这里提到的python脚本https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py

在对模型进行训练后,我得到了错误的结果,因为Spacy还检测到其他单词‘Animal’。

有没有人可以指导我,如何以正确的方式做到这一点? Spacy版本:2.1.8

推荐答案

Spacy NER模型训练包括提取其他"隐式"特征,如词性周围词

当您尝试针对单个单词进行训练时,无法获得足够的通用特征来检测这些实体。

以摘自Spacy's own training tutorial的这个例子为例:

train_data = [
    ("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
    ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
    ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
    ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
    ("Google rebrands its business apps", [(0, 6, "ORG")]),
    ("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
如果不是根据周围环境,NER模型如何才能正确猜测单词"Google"在该上下文中指的是哪种实体?你的话也是如此。NER不是类似"Regex"的函数,而是一种机器学习模型。

这篇关于如何利用单字数据集对客户进行Spacy方面的培训?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆