如何通过比较 pandas 中的两个数据框来分割输入 [英] How to split the input based by comparing two dataframes in pandas

查看:61
本文介绍了如何通过比较 pandas 中的两个数据框来分割输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从数据库中获取两个表中的输入和关键字.因此,我要使用熊猫读取两个表,并使用相应的列来拆分数据,然后将输出写回到DB的同一表中.

I am trying to take the input and keyword in the two tables from the database. So am using pandas to read both the tables and using the respective columns for splitting up of data and then write back the output in the same table in DB.

我的输入:

原始输入

LARIDENT SRL
MIZUHO Corporation Gosen Factory
ZIMMER MANUFACTURING BV
GALT MEDICAL CORP
MIZUHO AMERICA INC
AVENT S de RL de CV
LUV N CARE LTD
STERIS ISOMEDIX PUERTO RICO INC
MEDISTIM INC
Cadence Science Inc
TECHNOLOGIES SA
AMG Mdicale Co Inc

我的关键字表:

**Name_Extension**                 **Company_Type**        **Priority**
  co llc                             Company LLC                2
  Pvt ltd                            Private Limited            8
  Corp                               Corporation                4
  CO Ltd                             Company Limited            3
  inc                                Incorporated               5
  CO                                 Company                    1
  ltd                                Limited                    7
  llc                                LLC                        6
  Corporation                        Corporation                4
  & Co                               Company                    1
  Company Limited                    Company Limited            3
  Limited                            Limited                    7
  Co inc                             Company Incorporated       9
  AB                                  AB                        10
  SA                                  SA                        11
  S A                                 SA                        11
  GmbH                                GmbH                      12
  Sdn Bhd                             Sdn Bhd                   13
  llp                                 LLP                       14
  co llp                              LLP                       14
  SA DE CV                           SA DE CV                   19
  Company                            Company                    1 
  Coinc                              Company Incorporated       9
  Coltd                              Company Limited            3

因此,如果输入(在表1中)具有任何名称扩展名(在表2中),则必须将其拆分并作为Core_input和Type_input列放入,其中,核心输入将包含公司名称,而type_input将包含公司类型(来自表2第2列),并且必须优先检查.

So if the input(in table 1) has any of the name extension(this is in table 2) then it has to be split and put in as Core_input and Type_input columns where core input will contain the company names and type_input will contain the company type(from table 2 column 2) and it has to be checked with the priority.

我的输出将是:

Core_Input                                         Type_input
    NULL                                               NULL
    NULL                                               NULL
    NULL                                               NULL
   GALT MEDICAL                                    Corporation
   MIZUHO AMERICA                                   Incorporated
     NULL                                               NULL
   LUV N CARE                                         Limited
 STERIS ISOMEDIX PUERTO RICO                         Incorporated
    MEDISTIM                                         Incorporated
   Cadence Science                                   Incorporated

我的代码:

k1=[]
k2=[]

df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)

for inp1 in df1['Name_Extension']:
    k1.append(inp1.strip())

for inp2 in df1['Company_Type']:
    k2.append(inp2.strip())


p=1
p1=max(df1['Priority'])

for k1 in df1['Name_Extension']:
    for k2 in df1['Company_Type']:
      #for pr in df1['Priority']:
         for i in df['Cleansed_Input']:
            while p<=p1:
                if re.search(r'[^>]*?\s'+str(k1).strip(),str(i).strip(),re.I) and (p == (pr for pr in 
                                                                               df1['Priority'])):
                    splits = i.str.split(str(k1),re.I)

                    df['Core_Input'] = splits[0] #df['Cleansed_Input'].str.replace(str(k1),'',re.I) 

                    df['Type_input'] = str(k2)
                 p=p+1
data.to_sql('Testtable', con=engine, if_exists='replace',index= False)

感谢您的帮助.

df=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)

df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)

ext_list = df1['Name_Extension']
type_list =df1['Company_Type']

for i, j in df.iterrows():
    comp_name = df['Original_Input'][i]
    for idx, ex in enumerate(ext_list):
        if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
            df['Core_Input'] = type_list[idx]
            df['Type_input'].iloc[i] = comp_type

print(df)
df.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)


Edit:
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']

for i, j in enumerate(df['Cleansed_Input']):
    comp_name = df['Cleansed_Input'][i]

    for idx, ex in enumerate(ext_list):
        comp_name.replace('.,','')
        if re.search(rf'(\b{ex}\b)', comp_name, re.I):
            comp_type = type_list[idx]
            df['Type_input'].iloc[i]= comp_type
            # Delete the extension name from company name
            updated_comp_name = 
            re.sub(rf'(\b{str(ex).upper()}\b)','',str(comp_name).upper())
            # Above regex is leaving space post word removal adding space 
            from next word becomes 2 spaces
            updated_comp_name = str(updated_comp_name).replace('  ',' ')
            # Update the company name
            df['Core_Input'].iloc[i] = updated_comp_name

推荐答案

希望在下面几行中找到您的解决方案...由于某些原因,我没有使用SQL,但是您使用了2个不同的Excel来处理您的数据..您需要在输入表中添加一列Type,然后再运行代码...

Hi Hope below lines help you to get the solution...i am not using SQL due to some reason, but taken your data in 2 different excels...you need to add a column Type in input table before run the code...

import pandas as pd
import numpy
import re

input_df = pd.read_excel('input.xlsx',sheet_name='Sheet1')
exts_df = pd.read_excel('exts.xlsx', sheet_name='Sheet1')

# Check if correct data is loaded
print(input_df.head())

ext_list = exts_df['Name_Extension']
type_list =exts_df['Company_Type']

for i, j in input_df.iterrows():
    comp_name = input_df['Company Names'][i]
    for idx, ex in enumerate(ext_list):
        if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
            comp_type = type_list[idx]
            input_df['Type'].iloc[i] = comp_type
            # Delete teh extension name from company name
            updated_comp_name = re.sub(rf'\b{str(ex).upper()}\b','',str(comp_name).upper())
            # Above regex is leaving space post word removal adding space from next word becomes 2 spaces
            updated_comp_name = str(updated_comp_name).replace('  ',' ')
            # Update the company name
            input_df['Company Names'].iloc[i] = updated_comp_name

print(input_df)
input_df.to_excel('output.xlsx', index=False)

输入公司名称列映射Company_Type的输出帖子删除扩展名...

output post removal extension from input Company Name Column mapping Company_Type ...

...

这篇关于如何通过比较 pandas 中的两个数据框来分割输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆