计算NLTK和Scikit中两组关键字的精度和召回率,用于不同大小的组 [英] Computing precision and recall for two sets of keywords in NLTK and Scikit for sets of different sizes
问题描述
我正在尝试计算两组关键字的精度和召回率. gold_standard
具有823个术语,而test
具有1497个术语.
I am trying to compute precision and recall for two sets of keywords. The gold_standard
has 823 terms and the test
has 1497 terms.
使用precision
和recall
的nltk.metrics
版本,我可以提供这两组数据.但是对Scikit进行同样的操作会抛出一个错误:
Using nltk.metrics
's version of precision
and recall
, I am able to provide the two sets just fine. But doing the same for Scikit is throwing me an error:
ValueError:找到样本数量不一致的数组:[823 1497]
ValueError: Found arrays with inconsistent numbers of samples: [ 823 1497]
我该如何解决?
#!/usr/bin/python3
from nltk.metrics import precision, recall
from sklearn.metrics import precision_score
from sys import argv
from time import time
import numpy
import csv
def readCSVFile(filename):
termList = set()
with open(filename, 'rt', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
termList.update(row)
return termList
def readDocuments(gs_file, fileToProcess):
print("Reading CSV files...")
gold_standard = readCSVFile(gs_file)
test = readCSVFile(fileToProcess)
print("All files successfully read!")
return gold_standard, test
def calcPrecisionScipy(gs, test):
gs = numpy.array(list(gs))
test = numpy.array(list(test))
print("Precision Scipy: ",precision_score(gs, test, average=None))
def process(datasest):
print("Processing input...")
gs, test = dataset
print("Precision: ", precision(gs, test))
calcPrecisionScipy(gs, test)
def usage():
print("Usage: python3 generate_stats.py gold_standard.csv termlist_to_process.csv")
if __name__ == '__main__':
if len(argv) != 3:
usage()
exit(-1)
t0 = time()
process(readDocuments(argv[1], argv[2]))
print("Total runtime: %0.3fs" % (time() - t0))
我参考了以下页面进行编码:
I referred to the following pages for coding:
- http://scikit-learn.org/stable/modules/generation/sklearn.metrics.precision_recall_fscore_support.html
- http://scikit- Learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
- http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
- http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
=================================更新============= =====================
=================================Update===================================
好的,所以我尝试将非敏感"数据添加到列表中,以使它们的长度相等:
Okay, so I tried to add 'non-sensical' data to the list to make them equal length:
def calcPrecisionScipy(gs, test):
if len(gs) < len(test):
gs.update(list(range(len(test)-len(gs))))
gs = numpy.array(list(gs))
test = numpy.array(list(test))
print("Precision Scipy: ",precision_score(gs, test, average=None))
现在我还有另一个错误:
Now I have another error:
UndefinedMetricWarning:精度定义不明确,在没有预测样本的标签中将其设置为0.0.
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
推荐答案
科学上似乎无法计算精度或召回两组不同长度的数据. 我猜nltk必须做的是将集合截断为相同的长度,您可以在脚本中执行相同的操作.
seems scientifically not possible to compute precision or recall of two sets of different lengths. I guess what nltk must do is to truncate the sets to the same lengths, you can do the same in your script.
import numpy as np
import sklearn.metrics
set1 = [True,True]
set2 = [True,False,False]
length = np.amin([len(set1),len(set2)])
set1 = set1[:length]
set2 = set2[:length]
print sklearn.metrics.precision_score(set1,set2))
这篇关于计算NLTK和Scikit中两组关键字的精度和召回率,用于不同大小的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!