匹配两列之间的行以获得完全或部分匹配 [英] Matching rows between two columns to get an exact or partial match

查看:130
本文介绍了匹配两列之间的行以获得完全或部分匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用power bi编写查询时遇到了问题,该查询可以部分或完全匹配两行。除此之外,我还要尝试计算匹配百分比(如果找到)并将结果放入新列中。



实际数据集包含很多行和表,但是对于此示例,我仅使用4列。



ID和 Text列是唯一标识的。 'KI ID'和'KI Test'也是唯一的,但它们仅在发生匹配时才与列'ID'和'Text'相关。



我需要实现以下内容:



我想将文本列中每一行的输入与 KI文字列。如果有匹配项,那么我想知道 KI ID和匹配项百分比。看一下数据集以获得更好的洞察力。



ps:这实际上是用功率查询可以实现的,还是只是一种幻想,因为我认为我正在走向机器学习?



数据集



<然后,仍在同一个查询中工作,我将其分为两个新表:一个ID表和一个KI ID表。您可以通过选择现有的ID和Text列来创建第一个表,然后依次选择Home> Remove Columns> Remove Other Columns;但是要制作第二张表,您需要使用编辑栏。





在制作第二张表格之前,我在第一个表(ID表)中添加了一个列,每行中的数字为1。我叫该列ID匹配键。





然后,我从创建它的应用步骤中复制了第一个表的公式,以在公式栏中使用它来为KI ID创建第二个表。我对其进行了编辑,并根据KI ID更改了列的名称:





然后,我在第二个表(KI ID表)中添加了一行,每行的编号为1。我称该列为KI ID匹配键。





然后,我对刚才创建的两个表(ID和KI ID表)进行了完全外部合并。为此,我首先使用Home> Merge Queries并将Table1与自身合并。 (用于匹配的列无关紧要,因为这是临时的。)我选择完全外部作为连接类型。合并完成后,我在编辑栏中编辑了合并,将表更改为# Added ID Match Key和# Added KI ID Match Key(恰好是这两个表的名称)在我创建它们并将它们的匹配键添加到此合并中之后),并将相应的匹配字段分别添加到 ID匹配键和 KI ID匹配键中:



< a href = https://i.stack.imgur.com/Xw6cC.png rel = nofollow noreferrer>



合并后,我扩展了结果列:





然后,我通过选择两列(用单击一列,然后单击并按住Ctrl),然后单击变换>替换值>在要查找的值框,将替换为框留空,然后单击确定。



然后,我添加了列以拆分每个文本和 KI文本列单元格变成单词列表。在进行每次拆分之前,我首先过滤掉所有内容,除了小写和大写文本a到z,数字0到9,空格和等号。我使用Text.SplitAny在任何空格或等号处分割单词:







然后我添加了一个列并确定了Text中的哪些单词在KI Text中。我使用List.Intersect来做到这一点。它列出了重复项,这是我想要的:





然后我添加一列,然后做相反的操作-确定KI Text中的单词在Text中是什么。再次,我使用List.Intersect:





然后,我收集了计数信息。我创建了一个新列,其中包含我创建的每个列表中的单词数(所有出现的次数)。我对文本,KI文本,KI文本中的文本和KI文本中的文本分别执行了一次。我只是使用List.Count来做到这一点。实际上,这不需要在单独的列中执行单独的步骤。我只是为了清楚地看到数字。我在这四列后面的两列中生成的M代码没有使用这四列。我也在接下来的两列中进行了计数:





接下来,我创建了两个新列来计算KI文本中的Text百分比和Text中的KI Text的百分比。我使用了这些基本公式(希望在这里以清晰的英语显示,与M代码相对应):




  • %KI文本= KI文本中的文本中的单词数/
    KI文本中的单词数)* 100

  • %KI文本中的文本=($ b $的数量来自KI文本的b个单词(以文本形式显示/单词数))* 100



    请务必检查我的数学。






最后,我删除了所有不再需要的列。





在您发表最初的三个评论后,我在查询中添加了一些步骤以获取此信息:





这是M代码,其中包含我在您的三个评论后所做的修改。请参阅我的评论以获取更多上下文。

  let 
Source = Excel.CurrentWorkbook(){[Name = Table1 ]} [内容],
#更改的类型 = Table.TransformColumnTypes(来源,{{ ID,键入文本},{文本,键入文本},{ KI ID,键入文本},{ KI文本,键入文本},{结果,键入文本}}),
# Made ID Table = Table.SelectColumns(# Changed Type,{ ID, Text}),
#添加的ID匹配密钥 = Table.AddColumn(# Made ID表, ID匹配密钥,每个1),
# Make KI ID表 = Table.SelectColumns(# Changed Type,{ KI ID, KI Text}),
# Added KI ID Match Key = Table.AddColumn(# Made KI ID Table, KI ID匹配密钥,每个1),
#合并查询 = Table.NestedJoin(#添加ID匹配密钥,{ ID匹配密钥},#添加KI ID匹配密钥,{ KI ID匹配键}, KI表,JoinKind.FullOuter),
# Expanded KI Table = Table.ExpandTableColumn(# Merged Queries, KI Table,{ KI ID匹配键 , KI ID, KI文字},{ KI ID匹配密钥, KI ID, KI Text}),
#替换值 = Table.ReplaceValue(# Expanded KI Table,null,,Replacer.ReplaceValue,{ Text, KI Text}),
#将文本拆分为列表 = Table.AddColumn(#替换值,列出的文本,每个Text.SplitAny(Text.Select([Text],{ a .. z, A .. Z, 0 .. 9,, =}), =)),
#将KI文本拆分为列表 = Table.AddColumn(# 要列出的文本,列出的KI文本,每个Text.SplitAny(Text.Select([KI文本],{ a .. z, A .. Z, 0 .. 9,, =}), =)),
# KI文本中有文本 = Table.AddColumn(#将KI文本拆分为列表, KI文本中的文本 ,每个List.Intersect({[已列出KI文本],[已列出文本]})),
#文本中有KI文本 = Table.AddColumn(# KI文本中有文本, KI文本in Text中,每个List.Intersect({[[Text Listed],[KI Text Listed]})),
//您实际上不需要接下来的四行。我将它们包括在内,以便您可以在表中看到数字。它们在下面的两个计算所得...行中进行计算。
//如果选择删除它们,则需要在第一行 Calculated ...行中将 KI文本中的KI文本计数替换为在文本中的KI文本 。
#文本的计数 = Table.AddColumn(#文本中的KI文本,文本计数,每个List.Count([列出的文本])),
#文本计数的KI文本 = Table.AddColumn(#文本的计数, KI文本计数,每个List.Count([列出的KI文本])),
# KI文本中的文本计数 = Table.AddColumn(# KI文本的位数, KI文本数中的文本,每个List.Count([KI文本中的文本])),
# KI文本中的文本数 = Table.AddColumn(# KI文本中的文本计数, KI文本中的文本计数,每个List.Count([KI文本中的文本])),
# KI文本中的已计算百分比文本 = Table.AddColumn(# KI文本中的文本计数, KI文本中的%文本,每个Number.Round((List.Count([KI文本中的文本])/ List.Count([KI文本列出的)))* 100,2),键入数字),
#计算出的文本中的KI文本百分比 = Table.AddColumn(#计算出的文本中的KI文本百分比,文本中的KI的文本百分比,每个Number.Round((List.Count([KI Text in Text])/ List.Count([Text Listed]]))* 100,2),键入数字),
#已删除其他公司lumns = Table.SelectColumns(#计算出的文本中的KI文本百分比,{ ID,文本, KI ID, KI文本, KI文本中的百分比文本,%KI文本中的百分比 }),
//
//接下来是我在发表三个评论后的修改。
#已添加自定义 = Table.AddColumn(#已删除其他列,结果,每个Text.From([[# KI文本中的%文本]])&%与& [KI ID]),
#分组的行 = Table.Group(#添加的自定义,{ ID},{{ AllData,每个_,类型表[ID = text,Text =文本,KI ID =文本,KI文本=文本,#%KI文本中的文本 =数字,#%KI文本中的文本 =数字,结果=文本]}}),
#添加了Custom1 = Table.AddColumn(#分组的行,文本,每个[AllData] [Text] {0}),
# Added Custom2 = Table.AddColumn(# Added Custom1,结果,每个[AllData] [结果]),
#提取的值 = Table.TransformColumns(# Added Custom2,{结果,每个Text.Combine(List.Transform(_,Text.From ),#(cr)),键入文本}),
#已删除其他列1 = Table.SelectColumns(#提取值,{ ID,文本,结果})
中的
#已删除其他列1


I’m having a problem writing a query in power bi for matching two rows partially or completely. In addition to that I’m trying to calculate the match percentage if found and put the results in a new column.

The actual dataset contains a lot of rows and tables but for the sake of this example I’m using only 4 columns.

The columns 'ID' and 'Text' are uniquely identified. 'KI ID' and 'KI Test' are also unique but they are not related to columns 'ID' and 'Text' only when a match occurs.

What I need to implement is the following:

I would like to match the input of each row in the 'Text' column with each row in the 'KI text' column. If there is a match, then I would like to know the 'KI ID’ and the Match percentage. Take a look at the data set for a better insight.

ps: Is this actually achievable with power query or is it just a fantasy because in my perspective I’m heading towards machine learning, I think?

Data set https://drive.google.com/open?id=1JrsxPa6DICNi5N-tI5ESukh62W_uedaQ

enter image description here

The match calculation is based on the amount of words that occurs in both columns, 'Text' and 'KI Text'. for example, if one of the rows in the 'Text' column contains two sentences and these sentences partially matches with one of the 'KI Text' rows which has like 6 sentences in total. The match between the rows is partially so basically it should calculate it as 2/6 so it's like 33,3% match.

In addition, the 'KI Text' column contains a lot of rows that could passably match with one of the ‘Text' column rows. Only if it's greater or equals to 80% then it should show the results otherwise it’s not interesting.

解决方案

Hopefully, this is what you're looking for, or at least brings you closer to an answer.

While I believe your ID / Text and KI ID / KI Text data most likely come from two different tables, you presented them in one spreadsheet as your data set, so I started with that. The only content available to me in your spreadsheet was Sheet2. I simply copied and pasted the contents of your spreadsheet's Sheet2 to my own spreadsheet in Excel. I then used the default table name of Table1 to refer to it. I brought it in just like you presented it:

Then, still working within the same query, I separated it into two new tables: an ID table and a KI ID table. You can make the first table by selecting the existing ID and Text columns, then Home > Remove Columns > Remove Other Columns; but to make the second table, you would need to use the formula bar.

Before I made the second table, I added a column to the first table (the ID table) with the number 1 in every row. I called that column ID Match Key.

Then I copied the first table's formula from the applied step where I had created it, to use it in the formula bar to create the second table for KI ID. I edited it, changing the column's names as appropriate for KI ID:

Then I added a column with the number 1 in every row to the second table (the KI ID table). I called that column KI ID Match Key.

Then I did a full outer merge of the two tables that I had just made--the ID and KI ID tables. To do this, I first used Home > Merge Queries and merged Table1 with itself. (Which column to use for a match doesn't matter, because this is temporary.) I selected Full Outer as the Join Kind. Once the merge was done, I edited the merge in the formula bar, to change the tables to #"Added ID Match Key" and #"Added KI ID Match Key" (which happen to be the names of the those two "tables" after I created them and added their match keys to use in this merge) and the respective matching fields to "ID Match Key" and "KI ID Match Key":

After the merge, I expanded the resultant column:

Then I replaced all nulls in both the Text and KI Text columns with blank text by selecting both columns (by clicking one column, then clicking and holding Ctrl) then clicking Transform > Replace Values > typing null in the "Value To Find" box and leaving the "Replace With" box blank, and clicking OK.

Then I added columns to split each of the Text and KI Text columns cells into lists of their words. Before I did each split though, I first filtered out everything except lowercase and capital text a through z, numbers 0 through 9, spaces, and equal signs. I used Text.SplitAny to split the words at any spaces or equals signs:

Then I added a column and determined what words from Text were in KI Text. I used List.Intersect to do that. It lists the duplicates, which is something I wanted:

Then I added a column and did the reverse--determined what words from KI Text were in Text. Again, I used List.Intersect:

Then I collected count information. I created a new column with the count of the number of words (all occurrences) in each of the lists I had created. I did this once each for Text, KI Text, Text in KI Text, and KI Text in Text. I simply used List.Count to do this. This wasn't actually necessary to do as separate steps with separate columns. I just did it to clearly see the numbers. The M code I generated in the two columns that follow these four columns doesn't use these four columns. I did the counting in those next two columns as well:

Next, I created two new columns to calculate the % of Text in KI Text and the % of KI Text in Text. I used these basic formulas (presented here in, hopefully, clear english, versus the M code):

  • % Text in KI Text = (Number of words from Text that are in KI Text / Number of words in KI Text) * 100
  • % KI Text in Text = (Number of words from KI Text that are in Text / Number of words in Text) * 100

    Be sure to check my math.

Last, I removed all the columns I no longer wanted.

After your initial three comments, I added some steps to my query to get this:

Here's the M code with the edits that I made after your three comments. See my comments for more context.:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", type text}, {"Text", type text}, {"KI ID", type text}, {"KI Text", type text}, {"Outcome", type text}}),
    #"Made ID Table" = Table.SelectColumns(#"Changed Type",{"ID", "Text"}),
    #"Added ID Match Key" = Table.AddColumn(#"Made ID Table", "ID Match Key", each 1),
    #"Made KI ID Table" = Table.SelectColumns(#"Changed Type",{"KI ID", "KI Text"}),
    #"Added KI ID Match Key" = Table.AddColumn(#"Made KI ID Table", "KI ID Match Key", each 1),
    #"Merged Queries" = Table.NestedJoin(#"Added ID Match Key", {"ID Match Key"}, #"Added KI ID Match Key", {"KI ID Match Key"}, "KI Table", JoinKind.FullOuter),
    #"Expanded KI Table" = Table.ExpandTableColumn(#"Merged Queries", "KI Table", {"KI ID Match Key", "KI ID", "KI Text"}, {"KI ID Match Key", "KI ID", "KI Text"}),
    #"Replaced Value" = Table.ReplaceValue(#"Expanded KI Table",null,"",Replacer.ReplaceValue,{"Text", "KI Text"}),
    #"Split Text to List" = Table.AddColumn(#"Replaced Value", "Text Listed", each Text.SplitAny(Text.Select([Text],{"a".."z","A".."Z","0".."9"," ","="})," =")),
    #"Split KI Text to List" = Table.AddColumn(#"Split Text to List", "KI Text Listed", each Text.SplitAny(Text.Select([KI Text],{"a".."z","A".."Z","0".."9"," ","="})," =")),
    #"Got Text in KI Text" = Table.AddColumn(#"Split KI Text to List", "Text in KI Text", each List.Intersect({[KI Text Listed],[Text Listed]})),
    #"Got KI Text in Text" = Table.AddColumn(#"Got Text in KI Text", "KI Text in Text", each List.Intersect({[Text Listed], [KI Text Listed]})),
    //You don't actually need the next four lines. I included them so you can see the numbers in the table. They are calculated in the two #"Calculated..." lines below.
    //If you choose to remove them, you'll need to replace #"Got Count of KI Text in Text", in the first #"Calculated..." line, with #"Got KI Text in Text".
    #"Got Count of Text" = Table.AddColumn(#"Got KI Text in Text", "Text Count", each List.Count([Text Listed])),
    #"Got Count of KI Text" = Table.AddColumn(#"Got Count of Text", "KI Text Count", each List.Count([KI Text Listed])),
    #"Got Count of Text in KI Text" = Table.AddColumn(#"Got Count of KI Text", "Text in KI Text Count", each List.Count([Text in KI Text])),
    #"Got Count of KI Text in Text" = Table.AddColumn(#"Got Count of Text in KI Text", "KI Text in Text Count", each List.Count([KI Text in Text])),
    #"Calculated % Text in KI Text" = Table.AddColumn(#"Got Count of KI Text in Text", " % Text in KI Text", each Number.Round((List.Count([Text in KI Text])/List.Count([KI Text Listed]))*100, 2), type number),
    #"Calculated % KI Text in Text" = Table.AddColumn(#"Calculated % Text in KI Text", "% KI Text in Text", each Number.Round((List.Count([KI Text in Text])/List.Count([Text Listed]))*100, 2), type number),
    #"Removed Other Columns" = Table.SelectColumns(#"Calculated % KI Text in Text",{"ID", "Text", "KI ID", "KI Text", " % Text in KI Text", "% KI Text in Text"}),
    //
    //What follows is my edit after your three comments.
    #"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Outcome", each Text.From([#" % Text in KI Text"]) & "% match with " & [KI ID]),
    #"Grouped Rows" = Table.Group(#"Added Custom", {"ID"}, {{"AllData", each _, type table [ID=text, Text=text, KI ID=text, KI Text=text, #" % Text in KI Text"=number, #"% KI Text in Text"=number, Outcome=text]}}),
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", "Text", each [AllData][Text]{0}),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "Outcome", each [AllData][Outcome]),
    #"Extracted Values" = Table.TransformColumns(#"Added Custom2", {"Outcome", each Text.Combine(List.Transform(_, Text.From), "#(cr)"), type text}),
    #"Removed Other Columns1" = Table.SelectColumns(#"Extracted Values",{"ID", "Text", "Outcome"})
in
    #"Removed Other Columns1"

这篇关于匹配两列之间的行以获得完全或部分匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆