计算NLTK和Scikit中两组关键字的精度和召回率，用于不同大小的组 [英] Computing precision and recall for two sets of keywords in NLTK and Scikit for sets of different sizes

查看：126 发布时间：2020/5/18 1:19:14 python-3.x scikit-learn nltk

本文介绍了计算NLTK和Scikit中两组关键字的精度和召回率，用于不同大小的组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试计算两组关键字的精度和召回率. gold_standard具有823个术语，而test具有1497个术语.

I am trying to compute precision and recall for two sets of keywords. The gold_standard has 823 terms and the test has 1497 terms.

使用precision和recall的nltk.metrics版本，我可以提供这两组数据.但是对Scikit进行同样的操作会抛出一个错误:

Using nltk.metrics's version of precision and recall, I am able to provide the two sets just fine. But doing the same for Scikit is throwing me an error:

ValueError:找到样本数量不一致的数组:[823 1497]

ValueError: Found arrays with inconsistent numbers of samples: [ 823 1497]

我该如何解决?

#!/usr/bin/python3

from nltk.metrics import precision, recall
from sklearn.metrics import precision_score
from sys import argv
from time import time
import numpy
import csv

def readCSVFile(filename):
    termList = set()
    with open(filename, 'rt', encoding='utf-8') as f:
        reader = csv.reader(f)
        for row in reader:
            termList.update(row)
    return termList

def readDocuments(gs_file, fileToProcess):
    print("Reading CSV files...")
    gold_standard = readCSVFile(gs_file)
    test = readCSVFile(fileToProcess)
    print("All files successfully read!")
    return gold_standard, test

def calcPrecisionScipy(gs, test):
    gs = numpy.array(list(gs))
    test = numpy.array(list(test))
    print("Precision Scipy: ",precision_score(gs, test, average=None))

def process(datasest):
    print("Processing input...")
    gs, test = dataset
    print("Precision: ", precision(gs, test))
    calcPrecisionScipy(gs, test)

def usage():
    print("Usage: python3 generate_stats.py gold_standard.csv termlist_to_process.csv")

if __name__ == '__main__':
    if len(argv) != 3:
        usage()
        exit(-1)

    t0 = time()
    process(readDocuments(argv[1], argv[2]))
    print("Total runtime: %0.3fs" % (time() - t0))

我参考了以下页面进行编码:

I referred to the following pages for coding:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score

=================================更新============= =====================

=================================Update===================================

好的，所以我尝试将非敏感"数据添加到列表中，以使它们的长度相等:

Okay, so I tried to add 'non-sensical' data to the list to make them equal length:

def calcPrecisionScipy(gs, test):
    if len(gs) < len(test):
        gs.update(list(range(len(test)-len(gs))))
    gs = numpy.array(list(gs))
    test = numpy.array(list(test))
    print("Precision Scipy: ",precision_score(gs, test, average=None))

现在我还有另一个错误:

Now I have another error:

UndefinedMetricWarning:精度定义不明确，在没有预测样本的标签中将其设置为0.0.

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.

计算NLTK和Scikit中两组关键字的精度和召回率，用于不同大小的组 [英] Computing precision and recall for two sets of keywords in NLTK and Scikit for sets of different sizes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算NLTK和Scikit中两组关键字的精度和召回率，用于不同大小的组 [英] Computing precision and recall for two sets of keywords in NLTK and Scikit for sets of different sizes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭