如何通过比较 pandas 中的两个数据框来分割输入 [英] How to split the input based by comparing two dataframes in pandas
问题描述
我试图从数据库中获取两个表中的输入和关键字.因此,我要使用熊猫读取两个表,并使用相应的列来拆分数据,然后将输出写回到DB的同一表中.
I am trying to take the input and keyword in the two tables from the database. So am using pandas to read both the tables and using the respective columns for splitting up of data and then write back the output in the same table in DB.
我的输入:
原始输入
LARIDENT SRL
MIZUHO Corporation Gosen Factory
ZIMMER MANUFACTURING BV
GALT MEDICAL CORP
MIZUHO AMERICA INC
AVENT S de RL de CV
LUV N CARE LTD
STERIS ISOMEDIX PUERTO RICO INC
MEDISTIM INC
Cadence Science Inc
TECHNOLOGIES SA
AMG Mdicale Co Inc
我的关键字表:
**Name_Extension** **Company_Type** **Priority**
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
ltd Limited 7
llc LLC 6
Corporation Corporation 4
& Co Company 1
Company Limited Company Limited 3
Limited Limited 7
Co inc Company Incorporated 9
AB AB 10
SA SA 11
S A SA 11
GmbH GmbH 12
Sdn Bhd Sdn Bhd 13
llp LLP 14
co llp LLP 14
SA DE CV SA DE CV 19
Company Company 1
Coinc Company Incorporated 9
Coltd Company Limited 3
因此,如果输入(在表1中)具有任何名称扩展名(在表2中),则必须将其拆分并作为Core_input和Type_input列放入,其中,核心输入将包含公司名称,而type_input将包含公司类型(来自表2第2列),并且必须优先检查.
So if the input(in table 1) has any of the name extension(this is in table 2) then it has to be split and put in as Core_input and Type_input columns where core input will contain the company names and type_input will contain the company type(from table 2 column 2) and it has to be checked with the priority.
我的输出将是:
Core_Input Type_input
NULL NULL
NULL NULL
NULL NULL
GALT MEDICAL Corporation
MIZUHO AMERICA Incorporated
NULL NULL
LUV N CARE Limited
STERIS ISOMEDIX PUERTO RICO Incorporated
MEDISTIM Incorporated
Cadence Science Incorporated
我的代码:
k1=[]
k2=[]
df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)
for inp1 in df1['Name_Extension']:
k1.append(inp1.strip())
for inp2 in df1['Company_Type']:
k2.append(inp2.strip())
p=1
p1=max(df1['Priority'])
for k1 in df1['Name_Extension']:
for k2 in df1['Company_Type']:
#for pr in df1['Priority']:
for i in df['Cleansed_Input']:
while p<=p1:
if re.search(r'[^>]*?\s'+str(k1).strip(),str(i).strip(),re.I) and (p == (pr for pr in
df1['Priority'])):
splits = i.str.split(str(k1),re.I)
df['Core_Input'] = splits[0] #df['Cleansed_Input'].str.replace(str(k1),'',re.I)
df['Type_input'] = str(k2)
p=p+1
data.to_sql('Testtable', con=engine, if_exists='replace',index= False)
感谢您的帮助.
df=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']
for i, j in df.iterrows():
comp_name = df['Original_Input'][i]
for idx, ex in enumerate(ext_list):
if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
df['Core_Input'] = type_list[idx]
df['Type_input'].iloc[i] = comp_type
print(df)
df.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
Edit:
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']
for i, j in enumerate(df['Cleansed_Input']):
comp_name = df['Cleansed_Input'][i]
for idx, ex in enumerate(ext_list):
comp_name.replace('.,','')
if re.search(rf'(\b{ex}\b)', comp_name, re.I):
comp_type = type_list[idx]
df['Type_input'].iloc[i]= comp_type
# Delete the extension name from company name
updated_comp_name =
re.sub(rf'(\b{str(ex).upper()}\b)','',str(comp_name).upper())
# Above regex is leaving space post word removal adding space
from next word becomes 2 spaces
updated_comp_name = str(updated_comp_name).replace(' ',' ')
# Update the company name
df['Core_Input'].iloc[i] = updated_comp_name
推荐答案
希望在下面几行中找到您的解决方案...由于某些原因,我没有使用SQL,但是您使用了2个不同的Excel来处理您的数据..您需要在输入表中添加一列Type,然后再运行代码...
Hi Hope below lines help you to get the solution...i am not using SQL due to some reason, but taken your data in 2 different excels...you need to add a column Type in input table before run the code...
import pandas as pd
import numpy
import re
input_df = pd.read_excel('input.xlsx',sheet_name='Sheet1')
exts_df = pd.read_excel('exts.xlsx', sheet_name='Sheet1')
# Check if correct data is loaded
print(input_df.head())
ext_list = exts_df['Name_Extension']
type_list =exts_df['Company_Type']
for i, j in input_df.iterrows():
comp_name = input_df['Company Names'][i]
for idx, ex in enumerate(ext_list):
if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
comp_type = type_list[idx]
input_df['Type'].iloc[i] = comp_type
# Delete teh extension name from company name
updated_comp_name = re.sub(rf'\b{str(ex).upper()}\b','',str(comp_name).upper())
# Above regex is leaving space post word removal adding space from next word becomes 2 spaces
updated_comp_name = str(updated_comp_name).replace(' ',' ')
# Update the company name
input_df['Company Names'].iloc[i] = updated_comp_name
print(input_df)
input_df.to_excel('output.xlsx', index=False)
输入公司名称列映射Company_Type的输出帖子删除扩展名...
output post removal extension from input Company Name Column mapping Company_Type ...
...
这篇关于如何通过比较 pandas 中的两个数据框来分割输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!