使用Regex和Pandas格式化问题 [英] Formatting issues using Regex and Pandas

查看:81
本文介绍了使用Regex和Pandas格式化问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不完全知道如何描述我遇到的问题,所以我只展示一下. 我有2个数据表,并且我正在使用正则表达式根据是否与正确的单词匹配来搜索并提取这些表中的值.我将整个脚本作为参考.

I don't exactly know how to describe the issue I'm having, so I'll just show it. I have 2 data tables, and I'm using regex to search through and extract values in those tables based on if it matches with the correct word. I'll put the whole script for reference.

import re
import os
import pandas as pd
import numpy as np

os.chdir('C:/Users/Sams PC/Desktop')
f=open('test5.txt', 'w')

NHSQC=pd.read_csv('NHSQC.txt', sep='\s+', header=None)
NHSQC.columns=['Column_1','Column_2','Column_3']
HNCA=pd.read_csv('HNCA.txt', sep='\s+', header=None)
HNCA.columns=['Column_1','Column_2','Column_3','Column_4']
x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]-[H][N]',str(HNCA))
print (NHSQC)
print (HNCA)
print(x)
print (y)
data=[]
label=[]
for i in range (0,6):
    if x[i] in str(NHSQC):
        data2=NHSQC.set_index('Column_1',drop=False)
        data3=(data2.loc[str(x[i]), 'Column_2':'Column_3'])
        data.extend(list(data3))
        a=[x[i]]
        label.extend(a)
        label.extend(a)
        if y[i] in str(HNCA):
            data2=HNCA.set_index('Column_1',drop=False)
            data3=(data2.loc[str(y[i]),'Column_3'])
            data.append(data3)
            a=[y[i]]
            label.extend(a)

        else:
            print('Not Found')
    else:
        print('Not Found')


data6=[label,data]
matrix=data6
data5=np.transpose(matrix)
print(data5)

f.write(str(data5))
f.close()

此脚本完全可以执行我想要的操作,并且在运行测试数据文件时可以按预期工作,但是在运行实际数据文件时会失败.我不知道如何解释这个问题,所以我只展示它.这是输出:

This script, does exactly what I want it to do, and it works as intended when I run my test data files, but fails when I run my actual data files. I don't know how to explain the issue, so I'll just show it. This is the output:

     Column_1  Column_2  Column_3
0      S31N-HN   114.424     7.390
1      Y32N-HN   121.981     7.468
2      Q33N-HN   120.740     8.578
3      A34N-HN   118.317     7.561
4      G35N-HN   106.764     7.870
..         ...       ...       ...
89    R170N-HN   118.078     7.992
90    S171N-HN   110.960     7.930
91    R172N-HN   119.112     7.268
92   999_XN-HN   116.703     8.096
93  1000_XN-HN   117.530     8.040

[94 rows x 3 columns]
                Column_1 Column_2 Column_3 Column_4
0             Assignment       w1       w2       w3
1       S31N-A30CA-S31HN  114.424   54.808    7.393
2       S31N-A30CA-S31HN  126.854   53.005    9.277
3             S31N-CA-HN  114.424   61.717    7.391
4             S31N-HA-HN  126.864   59.633    9.287
..                   ...      ...      ...      ...
173          R170N-CA-HN  118.016   60.302    7.999
174  S171N-R170CA-S171HN  110.960   60.239    7.932
175          S171N-CA-HN  110.960   60.946    7.931
176  R172N-S171CA-R172HN  119.112   60.895    7.264
177          R172N-CA-HN  119.112   55.093    7.265

[178 rows x 4 columns]
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN']
Traceback (most recent call last):
  File "test.py", line 29, in <module>
    if y[i] in str(HNCA):
IndexError: list index out of range

如您所见,存在一个问题,因为我的y的正则表达式未找到所有值.此外,我的x正则表达式有多少个问题(只有5个而不是应有的数百个).最初,我认为这只是一个显示内容(它不会显示数百个匹配项,因为这会花费很长时间),而且我还以为...中间的...打印表格也是出于显示目的.但是,如果我复制了部分HNCA.txt数据并将其另存为单独的文件,则可以解决此问题.

As you can see, there is an issue because my regex for y isn't finding all the values. Furthermore, there is an issue with how many my x regex is finding (only 5 instead of the hundreds it should be). Initially I thought this was just a display thing (it wasn't displaying the hundreds of matches since it would take too long), and I also thought the ... in the middle of it printing my table was also for display purposes. However, if I copy part of my HNCA.txt data and save it as a separate file, it fixes the issue.

[94 rows x 3 columns]
            Column_1 Column_2 Column_3 Column_4
0         Assignment       w1       w2       w3
1   S31N-A30CA-S31HN  114.424   54.808    7.393
2   S31N-A30CA-S31HN  126.854   53.005    9.277
3         S31N-CA-HN  114.424   61.717    7.391
4         S31N-HA-HN  126.864   59.633    9.287
5   Y32N-S31CA-Y32HN  121.981   61.674    7.467
6         Y32N-CA-HN  121.981   60.789    7.469
7   Q33N-Y32CA-Q33HN  120.770   60.775    8.582
8         Q33N-CA-HN  120.701   58.706    8.585
9   A34N-Q33CA-A34HN  118.317   58.740    7.559
10        A34N-CA-HN  118.317   52.260    7.565
11  G35N-A34CA-G35HN  106.764   52.195    7.868
12        G35N-CA-HN  106.764   46.507    7.868
13  R36N-G35CA-R36HN  117.833   46.414    8.111
14        R36N-CA-HN  117.833   54.858    8.112
15  G37N-R36CA-G37HN  110.365   54.808    8.482
16        G37N-CA-HN  110.365   44.901    8.484
17        I55N-CA-HN  118.132   65.360    7.935
18  Y56N-I55CA-Y56HN  123.025   65.464    8.088
19        Y56N-CA-HN  123.025   62.195    8.082
20  A57N-Y56CA-A57HN  120.470   62.159    7.978
21        A57N-CA-HN  120.447   55.522    7.980
22  S72N-K71CA-S72HN  117.239   55.390    8.368
23        S72N-CA-HN  117.259   58.583    8.362
24  C73N-S72CA-C73HN  128.142   58.569    9.690
25        C73N-CA-HN  128.142   61.410    9.677
26  G74N-C73CA-G74HN  116.187   61.439    9.439
27        G74N-CA-HN  116.194   46.528    9.437
28  H75N-G74CA-H75HN  122.640   46.307    9.642
29        H75N-CA-HN  122.621   56.784    9.644
30  C76N-H75CA-C76HN  122.775   56.741    7.152
31        C76N-CA-HN  122.738   57.527    7.146
32  R77N-C76CA-R77HN  120.104   57.532    8.724
33        R77N-CA-HN  120.135   59.674    8.731
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN', 'Y32N-CA-HN', 'Q33N-CA-HN', 'A34N-CA-HN', 'G35N-CA-HN', 'R36N-CA-HN', 'G37N-CA-HN', 'I55N-CA-HN', 'Y56N-CA-HN', 'A57N-CA-HN', 'S72N-CA-HN', 'C73N-CA-HN', 'G74N-CA-HN', 'H75N-CA-HN', 'C76N-CA-HN', 'R77N-CA-HN']
[['S31N-HN' '114.42399999999999']

我不会发布整个输出,但是正如您所看到的,现在它可以找到所有适当的匹配项.现在,它还显示整个表格,而不是显示...,而仅显示上半部分和下半部分.我不完全了解这个问题是从哪里产生的.为什么只显示表格的上半部分和下半部分,但是如果我将其复制并粘贴到另一个文件中,则会显示整个内容.为什么regex即使未显示也不会搜索整个表(基于它显示了上半部分和下半部分的事实,使我认为整个表都在那儿,但由于它试图简化表格的内容,因此它仍然没有显示出来)显示,但是为什么显示的内容会影响正则表达式正在搜索的内容?

I won't post the whole output, but as you can see, now it finds all the proper matches. Its also now displaying the entire table, instead of doing ... and only showing the top and bottom halves. I don't exactly understand where this issue is arising from though. Why is it displaying only the top and bottom half of my table, but if I copy and paste it to another file, it displays the entire thing. Why does regex not search through the entire table even if it isn't displayed (based on the fact it shows the top and bottom half, makes me think the entire table is there, but again its not showing it because its trying to simplify the display, but why would whats being displayed effect what regex is searching)?

推荐答案

为什么python只显示表格的顶部和底部?

Python类可以定义两个魔术"方法:

  • __repr__() ,应该产生对象作为字符串的表示形式",并且对于大多数对象而言,它具有非常无用的默认实现;和
  • __str__() ,应该产生对象的可读的字符串",后退到__repr__().
  • __repr__(), which is supposed to produce a "representation" of the object as a string, and which has a pretty useless default implementation for most objects; and
  • __str__(), which is supposed to produce a readable "string" of the object, and which falls back to __repr__().

当行x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))运行时,最后一个str(NHSQC)位告诉python调用NHSCQ.__str__(),回退到NHSCQ.__repr__(),您可以阅读有关

When the line x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC)) is run, that last str(NHSQC) bit tells python to call NHSCQ.__str__(), which falls back to NHSCQ.__repr__(), which you can read about here.

pandas库的开发人员以

The developers of the pandas library implemented DataFrame.__repr__() in such a way that, depending on the values of certain global variables, will produce a string that does not fully represent the underlying data. The defaults truncate the DataFrame to show only the first 5 and last 5 rows with ellipses (...) telling you that there are bits missing. Thus, as you suspected, you are only calling re.findall on the first 5 and last 5 rows of the DataFrame.

使用str(NHSQC)可能不是您打算做的.这会将整个DataFrame转换为(不完整的)字符串表示形式,然后在该整个字符串上运行正则表达式搜索.这效率极低,所以为什么不使用 Series.str 方法代替?

Using str(NHSQC) is probably not what you intend to do. This converts the entire DataFrame into a (incomplete) string representation, then runs the regular expression search over that entire string. That's extremely inefficient, so why not use the Series.str methods instead?

例如,您似乎正在排列Column_2Column_3来自DataFrame NHSQC的行,其中Column_1的值与第一个正则表达式匹配,并且与Column_3来自DataFrame HNCA的行匹配Column_1的值与第二个正则表达式匹配,对吧?

For instance, you appear to be lining up Column_2 and Column_3 of rows from DataFrame NHSQC where the value of Column_1 matches the first regex in order with Column_3 of rows from DataFrame HNCA where the value of Column_1 matches the second regex, right?

df1 = NHSQC.loc[NHSQC["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-HN"))]
df2 = HNCA.loc[HNCA["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-CA-HN")), ["Column_1", "Column_3"]]

这些行将使用

第一行使用 DataFrame.melt df1的三列转换为更长"的版本,其中以Column_1列作为标识符,variable作为字符串"Column_2""Column_3",而value包含您所要的内容真正关心并在程序结尾打印.您不再使用列名,因此它是

The first line uses DataFrame.melt to turn the three columns of df1 into a "longer" version with columns Column_1 as an identifier, variable as either the strings "Column_2" or "Column_3", and value, containing the thing you actually care about and are printing at the end of your program. You don't use the column name anymore, so it is dropped. The DataFrame df2 doesn't need to be converted to a longer format because it only has two columns, so we just rename Column_3 to value.

extra_long = pd.concat([long1, long2])
print(extra_long.to_numpy())

这只是串联将两个长的DataFrame在一起,将它们变成

This just concatenates the two long DataFrames together, turns them into a numpy array, then prints them out.

这篇关于使用Regex和Pandas格式化问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆