Python pandas:从空格分隔的'.dat'文件生成文档术语矩阵 [英] Python pandas: Generate Document-Term matrix from whitespace delimited '.dat' file

查看:243
本文介绍了Python pandas:从空格分隔的'.dat'文件生成文档术语矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python尝试使用 Okapi BM25模型对文档进行排名.. >

我认为我可以以更有效的方式来计算Score(D,Q)所需的某些术语,例如IDF(反向文档频率)(即:计算特定术语(列)的所有非零行) .此外,我可以在矩阵中为实际得分添加新列,然后以此对文档进行排名.

文档术语向量存储在.dat文件中,该文件的结构如下:

D1 7:10 2:5
D2 1:2 3:4

其中D1是文档ID,7:10表示ID 7出现10

此刻,我正在使用以下代码将其读入列表列表:

fname = "dtv.dat"
f = open(fname, "r")
l = [x.strip(" \n").split(" ") for x in f.readlines()]

对于给定的示例将产生以下输出:

[['D1', '7:10', '2:5'],['D2' '1:2', '3:4']]

鉴于此列表格式列表,将其转换为类似于以下内容的Python pandas DataFrame的最有效方法是什么:

0      1     2      3      7
D1     0     5      0      10    
D2     2     0      4      0

解决方案

如果每个文档在文件中仅出现一次,您的答案似乎还可以.否则,该代码将覆盖dict d中的某些记录.

我认为以下内容会更笼统:

import numpy as np
import pandas as pd

fname = 'example.txt'

full_list = []
with open(fname, "r") as f:
    for line in f:
        arr = line.strip(" \n").split(" ")
        for chunk in arr[1:]:
            # converting numbers to ints:
            int_pair = [int(x) for x in chunk.split(":")]
            full_list.append([arr[0], *int_pair])

df = pd.DataFrame(full_list)

df2 = df.pivot_table(values = 2, index = 0, columns = 1, aggfunc = np.sum, fill_value = 0)

工作原理:

>>> cat 'example.txt'
D1 1:3 2:2 3:3
D2 1:4 2:7 
D2 7:1
D1 2:4 4:2
D1 4:1 4:3
>>> full_list
Out[37]: 
[['D1', 1, 3],
 ['D1', 2, 2],
 ['D1', 3, 3],
 ['D2', 1, 4],
 ['D2', 2, 7],
 ['D2', 7, 1],
 ['D1', 2, 4],
 ['D1', 4, 2],
 ['D1', 4, 1],
 ['D1', 4, 3]]
>>> df
Out[38]: 
    0  1  2
0  D1  1  3
1  D1  2  2
2  D1  3  3
3  D2  1  4
4  D2  2  7
5  D2  7  1
6  D1  2  4
7  D1  4  2
8  D1  4  1
9  D1  4  3
>>> df2
Out[39]: 
1   1  2  3  4  7
0                
D1  3  6  3  6  0
D2  4  7  0  0  1

I'm using Python to attempt to rank documents using an Okapi BM25 model.

I think that I can calculate some of the terms required for the Score(D,Q) such as the IDF (Inverse Document Frequency) in a more efficient way (i.e: Counting all non-zero rows for a particular term (column)). Furthermore, I can add a new column to the matrix for the actual Score and then sort by this to rank documents.

The document term vectors are stored in a .dat file which is structured like the following:

D1 7:10 2:5
D2 1:2 3:4

where D1 is a document ID and 7:10 represents the term with ID 7 appearing 10 times

At the moment, I am reading it into a list of lists using the following code:

fname = "dtv.dat"
f = open(fname, "r")
l = [x.strip(" \n").split(" ") for x in f.readlines()]

which yields the following output for the given example:

[['D1', '7:10', '2:5'],['D2' '1:2', '3:4']]

Given this list of list format, what is the most efficient way to convert this to a Python pandas DataFrame similar to the following:

0      1     2      3      7
D1     0     5      0      10    
D2     2     0      4      0

解决方案

Your answer seems to be ok if each document appears only once in the file. Otherwise, the code will overwrite some records in dict d.

I think the following would be more general:

import numpy as np
import pandas as pd

fname = 'example.txt'

full_list = []
with open(fname, "r") as f:
    for line in f:
        arr = line.strip(" \n").split(" ")
        for chunk in arr[1:]:
            # converting numbers to ints:
            int_pair = [int(x) for x in chunk.split(":")]
            full_list.append([arr[0], *int_pair])

df = pd.DataFrame(full_list)

df2 = df.pivot_table(values = 2, index = 0, columns = 1, aggfunc = np.sum, fill_value = 0)

How it works:

>>> cat 'example.txt'
D1 1:3 2:2 3:3
D2 1:4 2:7 
D2 7:1
D1 2:4 4:2
D1 4:1 4:3
>>> full_list
Out[37]: 
[['D1', 1, 3],
 ['D1', 2, 2],
 ['D1', 3, 3],
 ['D2', 1, 4],
 ['D2', 2, 7],
 ['D2', 7, 1],
 ['D1', 2, 4],
 ['D1', 4, 2],
 ['D1', 4, 1],
 ['D1', 4, 3]]
>>> df
Out[38]: 
    0  1  2
0  D1  1  3
1  D1  2  2
2  D1  3  3
3  D2  1  4
4  D2  2  7
5  D2  7  1
6  D1  2  4
7  D1  4  2
8  D1  4  1
9  D1  4  3
>>> df2
Out[39]: 
1   1  2  3  4  7
0                
D1  3  6  3  6  0
D2  4  7  0  0  1

这篇关于Python pandas:从空格分隔的'.dat'文件生成文档术语矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆