根据条件在一个数据帧中拆分并替换为 pandas 中的另一个数据帧 [英] Split and replace in one dataframe based on a condition with another dataframe in pandas

查看:32
本文介绍了根据条件在一个数据帧中拆分并替换为 pandas 中的另一个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框,都包含 sql 表.

I have two dataframes and both contains sql table.

这是我的第一个数据框

Original_Input           Cleansed_Input        Core_Input    Type_input
TECHNOLOGIES S.A         TECHNOLOGIES SA        
A & J INDUSTRIES, LLC    A J INDUSTRIES LLC     
A&S DENTAL SERVICES      AS DENTAL SERVICES     
A.M.G Médicale Inc       AMG Mdicale Inc        
AAREN SCIENTIFIC         AAREN SCIENTIFIC   

我的第二个数据框是:

Name_Extension     Company_Type     Priority
co llc             Company LLC       2
Pvt ltd            Private Limited   8
Corp               Corporation       4
CO Ltd             Company Limited   3
inc                Incorporated      5
CO                 Company           1

我删除了标点符号、ASCII 和数字,并将这些数据放在 df1cleansed_input 列中.

I removed, punctuations, ASCII and digits and have put this data in the cleansed_input column in df1.

df1 中的 cleansed_input 列需要与 df2Name_Extension 列进行核对.如果来自 cleansed_input 的值最后有来自 Name_Extension 的任何值,那么应该将其拆分并放入 df1<的 type_input 列/code> 不仅如此,而且还缩写.

That cleansed_input column in df1 needs to be checked with the Name_Extension column of df2. If the value from cleansed_input has any value from Name_Extension at the end then that should be split and put in type_input column of df1 and not just like that but abbreviated.

例如,如果 CO 出现在 cleansed_column 中,那么它应该缩写为 Company 并放入 type_input 列 和剩余的文本应该在 df1core_type 列中.也有优先权,不确定是否需要.

For example, if CO is present in cleansed_column then that should be abbreviated as Company and put in the type_input column and the remaining text should be in core_type column of df1. Also there is priority given, am not sure if thats needed.

预期输出:

Original_Input          Cleansed_Input        Core_Input       Type_input
TECHNOLOGIES S.A        TECHNOLOGIES SA       TECHNOLOGIES      SA
A & J INDUSTRIES, LLC   A J INDUSTRIES LLC    A J INDUSTRIES    LLC
A&S DENTAL SERVICES     AS DENTAL SERVICES      
A.M.G Médicale Inc      AMG Mdicale Inc       AMG Mdicale       Incorporated
AAREN SCIENTIFIC        AAREN SCIENTIFIC        

我尝试了很多方法,例如 isin、mask、contains 等,但不知道该放什么.

I tried many methods like with isin, mask, contains, etc but am not sure what to put in where.

我收到一个错误消息,指出系列是可变的,它们不能被散列".当我尝试使用数据框时,我不确定为什么会出现该错误.

I got an error that said "Series are mutable, they cannot be hashed". I was unsure of why I got that error when I was trying things with dataframe.

我没有那个代码,我正在使用 jupiter notebook 和 sql server,而 isin 似乎在 jupiter 中不起作用.

I am not having that code and am using jupiter notebook and sql server and isin doesn't seem to work in jupiter.

同样的,还有一个分割要完成.要拆分的 original_input 列是 parent_compnay 名称和别名.

Same way there is another split to be done. The original_input column to be split as parent_compnay name and alias name.

Here is my code:

import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re

#Connecting SQL with Python

server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'


engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'? 
driver=SQL+server')

#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size() 
#print(gp)

#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join(['' 
if ord(i) < 32 or ord(i) > 126 else i for i in x]))

#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.punctuation]))

#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.digits]))

#4.Removing trialing and leading spaces 
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())

df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)


df2 = pd.DataFrame({ 
"Name_Extension": ["llc",
                   "Pvt ltd",
                   "Corp",
                   "CO Ltd",
                   "inc", 
                   "CO",
                   "SA"],
"Company_Type": ["Company LLC",
                 "Private Limited",
                 "Corporation",
                 "Company Limited",
                 "Incorporated",
                 "Company",
                 "Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})

data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)

推荐答案

以下是您可以实施的可能解决方案:

Here is a possible solution you can implement:

df = pd.DataFrame({
    "Original_Input": ["TECHNOLOGIES S.A", 
                       "A & J INDUSTRIES, LLC", 
                       "A&S DENTAL SERVICES", 
                       "A.M.G Médicale Inc", 
                       "AAREN SCIENTIFIC"],
    "Cleansed_Input": ["TECHNOLOGIES SA", 
                       "A J INDUSTRIES LLC", 
                       "AS DENTAL SERVICES", 
                       "AMG Mdicale Inc", 
                       "AAREN SCIENTIFIC"]
})

df_2 = pd.DataFrame({ 
    "Name_Extension": ["llc",
                       "Pvt ltd",
                       "Corp",
                       "CO Ltd",
                       "inc", 
                       "CO",
                       "SA"],
    "Company_Type": ["Company LLC",
                     "Private Limited",
                     "Corporation",
                     "Company Limited",
                     "Incorporated",
                     "Company",
                     "Anonymous Company"],
    "Priority": [2, 8, 4, 3, 5, 1, 9]
})

# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()

# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0]) 
                    for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
    .apply(lambda p: next(( priority 
                            for priority, extension in extensions_list 
                            if p.endswith(extension)), None))

# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of 
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")

# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
    lambda p: p["Cleansed_Input"] 
              if p["aux"] == 0 
              else p["Cleansed_Input"][:p["aux"]].strip(), 
    axis=1
)

# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]

我假设优先级"列具有唯一值.但是,如果不是这种情况,只需对优先级进行排序并根据该顺序创建索引,如下所示:

I assumed that the "Priority" column would have unique values. However, if this isn't the case, just sort the priorities and create an index based on that order like this:

df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))

此外,下次以任何人都可以轻松加载的格式给出数据示例.处理您发送的格式很麻烦.

Also, next time give the data example in a format that allows anyone to load easily. It was cumbersome to handle the format you sent.

EDIT:与问题无关,但可能会有所帮助.您可以使用以下内容简化从 1 到 4 的步骤:

EDIT: Not related with the question, but it might be of some help. You can simplify the steps from 1 to 4 with the following:

data['Cleansed_Input'] = data["Original_Input"] \
    .str.replace("[^\w ]+", "") \ # removes non-alpha characters
    .str.replace(" +", " ") \ # removes duplicated spaces
    .str.strip() # removes spaces before or after the string

EDIT 2:解决方案的 SQL 版本(我使用的是 PostgreSQL,但我使用的是标准 SQL 运算符,因此差异不应该太大).

EDIT 2: SQL version of the solution (I'm using PostgreSQL, but I used standard SQL operators, so the differences shouldn't be that huge).

SELECT t.Original_Name,
       t.Cleansed_Input,
       t.Name_Extension,
       t.Company_Type,
       t.Priority
FROM (
    SELECT df.Original_Name,
           df.Cleansed_Input,
           df_2.Name_Extension,
           df_2.Company_Type,
           df_2.Priority,
           ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
    FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
                 ('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
                 ('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
         LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
                           ('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
                           ('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
            ON  lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1

这篇关于根据条件在一个数据帧中拆分并替换为 pandas 中的另一个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆