使用Regex和Pandas格式化问题 [英] Formatting issues using Regex and Pandas
问题描述
我不完全知道如何描述我遇到的问题,所以我只展示一下. 我有2个数据表,并且我正在使用正则表达式根据是否与正确的单词匹配来搜索并提取这些表中的值.我将整个脚本作为参考.
I don't exactly know how to describe the issue I'm having, so I'll just show it. I have 2 data tables, and I'm using regex to search through and extract values in those tables based on if it matches with the correct word. I'll put the whole script for reference.
import re
import os
import pandas as pd
import numpy as np
os.chdir('C:/Users/Sams PC/Desktop')
f=open('test5.txt', 'w')
NHSQC=pd.read_csv('NHSQC.txt', sep='\s+', header=None)
NHSQC.columns=['Column_1','Column_2','Column_3']
HNCA=pd.read_csv('HNCA.txt', sep='\s+', header=None)
HNCA.columns=['Column_1','Column_2','Column_3','Column_4']
x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]-[H][N]',str(HNCA))
print (NHSQC)
print (HNCA)
print(x)
print (y)
data=[]
label=[]
for i in range (0,6):
if x[i] in str(NHSQC):
data2=NHSQC.set_index('Column_1',drop=False)
data3=(data2.loc[str(x[i]), 'Column_2':'Column_3'])
data.extend(list(data3))
a=[x[i]]
label.extend(a)
label.extend(a)
if y[i] in str(HNCA):
data2=HNCA.set_index('Column_1',drop=False)
data3=(data2.loc[str(y[i]),'Column_3'])
data.append(data3)
a=[y[i]]
label.extend(a)
else:
print('Not Found')
else:
print('Not Found')
data6=[label,data]
matrix=data6
data5=np.transpose(matrix)
print(data5)
f.write(str(data5))
f.close()
此脚本完全可以执行我想要的操作,并且在运行测试数据文件时可以按预期工作,但是在运行实际数据文件时会失败.我不知道如何解释这个问题,所以我只展示它.这是输出:
This script, does exactly what I want it to do, and it works as intended when I run my test data files, but fails when I run my actual data files. I don't know how to explain the issue, so I'll just show it. This is the output:
Column_1 Column_2 Column_3
0 S31N-HN 114.424 7.390
1 Y32N-HN 121.981 7.468
2 Q33N-HN 120.740 8.578
3 A34N-HN 118.317 7.561
4 G35N-HN 106.764 7.870
.. ... ... ...
89 R170N-HN 118.078 7.992
90 S171N-HN 110.960 7.930
91 R172N-HN 119.112 7.268
92 999_XN-HN 116.703 8.096
93 1000_XN-HN 117.530 8.040
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
.. ... ... ... ...
173 R170N-CA-HN 118.016 60.302 7.999
174 S171N-R170CA-S171HN 110.960 60.239 7.932
175 S171N-CA-HN 110.960 60.946 7.931
176 R172N-S171CA-R172HN 119.112 60.895 7.264
177 R172N-CA-HN 119.112 55.093 7.265
[178 rows x 4 columns]
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN']
Traceback (most recent call last):
File "test.py", line 29, in <module>
if y[i] in str(HNCA):
IndexError: list index out of range
如您所见,存在一个问题,因为我的y的正则表达式未找到所有值.此外,我的x正则表达式有多少个问题(只有5个而不是应有的数百个).最初,我认为这只是一个显示内容(它不会显示数百个匹配项,因为这会花费很长时间),而且我还以为...中间的...打印表格也是出于显示目的.但是,如果我复制了部分HNCA.txt数据并将其另存为单独的文件,则可以解决此问题.
As you can see, there is an issue because my regex for y isn't finding all the values. Furthermore, there is an issue with how many my x regex is finding (only 5 instead of the hundreds it should be). Initially I thought this was just a display thing (it wasn't displaying the hundreds of matches since it would take too long), and I also thought the ... in the middle of it printing my table was also for display purposes. However, if I copy part of my HNCA.txt data and save it as a separate file, it fixes the issue.
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
5 Y32N-S31CA-Y32HN 121.981 61.674 7.467
6 Y32N-CA-HN 121.981 60.789 7.469
7 Q33N-Y32CA-Q33HN 120.770 60.775 8.582
8 Q33N-CA-HN 120.701 58.706 8.585
9 A34N-Q33CA-A34HN 118.317 58.740 7.559
10 A34N-CA-HN 118.317 52.260 7.565
11 G35N-A34CA-G35HN 106.764 52.195 7.868
12 G35N-CA-HN 106.764 46.507 7.868
13 R36N-G35CA-R36HN 117.833 46.414 8.111
14 R36N-CA-HN 117.833 54.858 8.112
15 G37N-R36CA-G37HN 110.365 54.808 8.482
16 G37N-CA-HN 110.365 44.901 8.484
17 I55N-CA-HN 118.132 65.360 7.935
18 Y56N-I55CA-Y56HN 123.025 65.464 8.088
19 Y56N-CA-HN 123.025 62.195 8.082
20 A57N-Y56CA-A57HN 120.470 62.159 7.978
21 A57N-CA-HN 120.447 55.522 7.980
22 S72N-K71CA-S72HN 117.239 55.390 8.368
23 S72N-CA-HN 117.259 58.583 8.362
24 C73N-S72CA-C73HN 128.142 58.569 9.690
25 C73N-CA-HN 128.142 61.410 9.677
26 G74N-C73CA-G74HN 116.187 61.439 9.439
27 G74N-CA-HN 116.194 46.528 9.437
28 H75N-G74CA-H75HN 122.640 46.307 9.642
29 H75N-CA-HN 122.621 56.784 9.644
30 C76N-H75CA-C76HN 122.775 56.741 7.152
31 C76N-CA-HN 122.738 57.527 7.146
32 R77N-C76CA-R77HN 120.104 57.532 8.724
33 R77N-CA-HN 120.135 59.674 8.731
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN', 'Y32N-CA-HN', 'Q33N-CA-HN', 'A34N-CA-HN', 'G35N-CA-HN', 'R36N-CA-HN', 'G37N-CA-HN', 'I55N-CA-HN', 'Y56N-CA-HN', 'A57N-CA-HN', 'S72N-CA-HN', 'C73N-CA-HN', 'G74N-CA-HN', 'H75N-CA-HN', 'C76N-CA-HN', 'R77N-CA-HN']
[['S31N-HN' '114.42399999999999']
我不会发布整个输出,但是正如您所看到的,现在它可以找到所有适当的匹配项.现在,它还显示整个表格,而不是显示...,而仅显示上半部分和下半部分.我不完全了解这个问题是从哪里产生的.为什么只显示表格的上半部分和下半部分,但是如果我将其复制并粘贴到另一个文件中,则会显示整个内容.为什么regex即使未显示也不会搜索整个表(基于它显示了上半部分和下半部分的事实,使我认为整个表都在那儿,但由于它试图简化表格的内容,因此它仍然没有显示出来)显示,但是为什么显示的内容会影响正则表达式正在搜索的内容?
I won't post the whole output, but as you can see, now it finds all the proper matches. Its also now displaying the entire table, instead of doing ... and only showing the top and bottom halves. I don't exactly understand where this issue is arising from though. Why is it displaying only the top and bottom half of my table, but if I copy and paste it to another file, it displays the entire thing. Why does regex not search through the entire table even if it isn't displayed (based on the fact it shows the top and bottom half, makes me think the entire table is there, but again its not showing it because its trying to simplify the display, but why would whats being displayed effect what regex is searching)?
推荐答案
为什么python只显示表格的顶部和底部?
Python类可以定义两个魔术"方法:
-
__repr__()
,应该产生对象作为字符串的表示形式",并且对于大多数对象而言,它具有非常无用的默认实现;和 -
__str__()
,应该产生对象的可读的字符串",后退到__repr__()
.
__repr__()
, which is supposed to produce a "representation" of the object as a string, and which has a pretty useless default implementation for most objects; and__str__()
, which is supposed to produce a readable "string" of the object, and which falls back to__repr__()
.
当行x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
运行时,最后一个str(NHSQC)
位告诉python调用NHSCQ.__str__()
,回退到NHSCQ.__repr__()
,您可以阅读有关
When the line x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
is run, that last str(NHSQC)
bit tells python to call NHSCQ.__str__()
, which falls back to NHSCQ.__repr__()
, which you can read about here.
The developers of the pandas library implemented DataFrame.__repr__()
in such a way that, depending on the values of certain global variables, will produce a string that does not fully represent the underlying data. The defaults truncate the DataFrame to show only the first 5 and last 5 rows with ellipses (...
) telling you that there are bits missing. Thus, as you suspected, you are only calling re.findall
on the first 5 and last 5 rows of the DataFrame.
使用str(NHSQC)
可能不是您打算做的.这会将整个DataFrame转换为(不完整的)字符串表示形式,然后在该整个字符串上运行正则表达式搜索.这效率极低,所以为什么不使用 Series.str
方法代替?
Using str(NHSQC)
is probably not what you intend to do. This converts the entire DataFrame into a (incomplete) string representation, then runs the regular expression search over that entire string. That's extremely inefficient, so why not use the Series.str
methods instead?
例如,您似乎正在排列Column_2
和Column_3
来自DataFrame NHSQC
的行,其中Column_1
的值与第一个正则表达式匹配,并且与Column_3
来自DataFrame HNCA
的行匹配Column_1
的值与第二个正则表达式匹配,对吧?
For instance, you appear to be lining up Column_2
and Column_3
of rows from DataFrame NHSQC
where the value of Column_1
matches the first regex in order with Column_3
of rows from DataFrame HNCA
where the value of Column_1
matches the second regex, right?
df1 = NHSQC.loc[NHSQC["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-HN"))]
df2 = HNCA.loc[HNCA["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-CA-HN")), ["Column_1", "Column_3"]]
这些行将使用 Series.str.match
在Column_1
上.
long1 = df1.melt(id_vars=["Column_1"]).drop("variable", axis="columns")
long2 = df2.rename(columns={"Column_3": "value"})
第一行使用 DataFrame.melt
将df1
的三列转换为更长"的版本,其中以Column_1
列作为标识符,variable
作为字符串"Column_2"
或"Column_3"
,而value
包含您所要的内容真正关心并在程序结尾打印.您不再使用列名,因此它是重命名 Column_3
至value
.
The first line uses DataFrame.melt
to turn the three columns of df1
into a "longer" version with columns Column_1
as an identifier, variable
as either the strings "Column_2"
or "Column_3"
, and value
, containing the thing you actually care about and are printing at the end of your program. You don't use the column name anymore, so it is dropped. The DataFrame df2
doesn't need to be converted to a longer format because it only has two columns, so we just rename Column_3
to value
.
extra_long = pd.concat([long1, long2])
print(extra_long.to_numpy())
这只是串联将两个长的DataFrame在一起,将它们变成
This just concatenates the two long DataFrames together, turns them into a numpy array, then prints them out.
这篇关于使用Regex和Pandas格式化问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!