使用Pandas和Regex搜索并提取txt文件的值 [英] Using Pandas and Regex to search through and extract values of a txt file

查看:120
本文介绍了使用Pandas和Regex搜索并提取txt文件的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从2个数据表中提取值.这是我当前的脚本.

I have 2 data tables that I am attempting to extract values from. Here is my current script.

import re 
import os
import pandas as pd

os.chdir('C:/Users/Sams PC/Desktop')

test1=pd.read_csv('test1.txt', sep='\s+', header=None)
test1.columns=['Column_1','Column_2','Column_3']
test2=pd.read_csv('test2.txt', sep='\s+', header=None)
test2.columns=['Column_1','Column_2','Column_3','Column_4']

if 'S31N' in test1:
    data2=nhsqc[['Column_1','Column_2']].copy()
    if 'S31N-CA-HN' in test2:
        data2=nhsqc[['Column_3']].copy()
    else:
        print('Not Found')      
else:
    print('Not Found')


print(test1)
print (test2)

具有以下输出:

Not Found
0  S31N-HN   114.424     7.390
1  Y32N-HN   121.981     7.468
           Column_1  Column_2  Column_3  Column_4
0  S31N-A30CA-S31HN   114.424    54.808     7.393
1  S31N-A30CA-S31HN   126.854    53.005     9.277
2        S31N-CA-HN   114.424    61.717     7.391
3        S31N-HA-HN   126.864    59.633     9.287
4  Y32N-S31CA-Y32HN   121.981    61.674     7.467
5        Y32N-CA-HN   121.981    60.789     7.469
6  Q33N-Y32CA-Q33HN   120.770    60.775     8.582

我能够使用熊猫整理桌子.接下来,我要从与"S31N"相关的列中提取值.但是,如您所见,即使我的数据表中确实存在S31N,我的if线也无法正常工作.现在,如果我将该值更改为标题(如果在test1中为"Column_1" :),那么它将起作用.我不完全理解为什么它无法搜索实际表,而仅搜索列标题.

I am able to organize the tables using pandas. Next I want to extract values from columns associated with say 'S31N'. However, as you can see, my if line is not working in regards to finding S31N, even though it does exist in my data table. Now if I changed that value to my header (if 'Column_1' in test1:), then it will work. I don't exactly understand why it's unable to search the actual table, and is only searching the column headers.

此外,虽然我的if行确实起作用(如果我使用了列标题),但是第二if行将从第一if行覆盖data2表.如何将它作为额外的列添加到data2中,而不是覆盖它.

Furthermore, while my if line does work (if I used the column header), the 2nd if line overwrites the data2 table from the first if line. How can I have it be added to data2 as an extra column rather than overwriting it.

自从问题解决以来,我删除了第二部分.但是主要问题仍然存在,我的脚本仍然无法找到我的值.更新的脚本:

I removed the 2nd half since the issue was resolved. However the main issue still stands, my script is still unable to find my values. Updated script:

x=re.findall('[A-Z][0-9][0-9][A-Z]',str(test1))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]',str(test2))
print (x,y)

for i in range (0,2):
    if x[i] in test1:
        data2=nhsqc[['Column_1','Column_2']].copy()
        if y[i] in test2:
            data2=nhsqc[['Column_3']].copy()
            print (data2)
        else:   
            print('Not Found')      
    else:
        print('Not Found')


print(x[i])

输出:

['S31N', 'Y32N'] ['S31N-CA', 'Y32N-CA']
Not Found
Not Found
Y32N

推荐答案

我想,这可能会使您更接近.问题可能与test1test2的类型有关,它们在整个代码中更改str(test1)str(test1)可能是使其工作的一种方式.

I guess, this might get you closer. The problem is likely about the type of test1 and test2, which changing those throughout your code, str(test1) or str(test1) might be one way to make it work.

x=re.findall('[A-Z][0-9][0-9][A-Z]',str(test1))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]',str(test2))
print (x,y)

for i in range (0,2):
    if x[i] in str(test1):
        data2=nhsqc[['Column_1','Column_2']].copy()
        if y[i] in str(test2):
            data2=nhsqc[['Column_3']].copy()
            print (data2)
        else:   
            print('Not Found')      
    else:
        print('Not Found')


print(x[i])

模拟测试

import re
test1 = '''
0  S31N-HN   114.424     7.390
1  Y32N-HN   121.981     7.468
'''

test2 = '''
           Column_1  Column_2  Column_3  Column_4
0  S31N-A30CA-S31HN   114.424    54.808     7.393
1  S31N-A30CA-S31HN   126.854    53.005     9.277
2        S31N-CA-HN   114.424    61.717     7.391
3        S31N-HA-HN   126.864    59.633     9.287
4  Y32N-S31CA-Y32HN   121.981    61.674     7.467
5        Y32N-CA-HN   121.981    60.789     7.469
6  Q33N-Y32CA-Q33HN   120.770    60.775     8.582
'''

x = re.findall('[A-Z][0-9][0-9][A-Z]', str(test1))
y = re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]', str(test2))
print(x, y)

for i in range(0, 2):
    if x[i] in str(test1):
        print(x[i])
        data2 = nhsqc[['Column_1', 'Column_2']].copy()
        if y[i] in str(test2):
            data2 = nhsqc[['Column_3']].copy()
            print(y[i])
        else:
            print('Not Found')
    else:
        print('Not Found')

输出

['S31N', 'Y32N'] ['S31N-CA', 'Y32N-CA']
S31N
S31N-CA
Y32N
Y32N-CA

这篇关于使用Pandas和Regex搜索并提取txt文件的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆