匹配两列之间的行以获得完全匹配或部分匹配 [英] Matching rows between two columns to get an exact or partial match

查看:31
本文介绍了匹配两列之间的行以获得完全匹配或部分匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 power bi 编写查询以部分或完全匹配两行时遇到问题.除此之外,我正在尝试计算匹配百分比(如果找到)并将结果放入新列中.

实际的数据集包含很多行和表,但为了这个示例,我只使用了 4 列.

ID"和Text"列是唯一标识的.'KI ID' 和 'KI Test' 也是唯一的,但它们仅在匹配时才与列 'ID' 和 'Text' 相关.

我需要实现的是:

我想将文本"列中每一行的输入与KI 文本"列中的每一行进行匹配.如果有匹配,那么我想知道KI ID"和匹配百分比.查看数据集以获得更好的洞察力.

ps:这实际上可以通过 power query 实现,还是只是一种幻想,因为在我看来,我正在走向机器学习?

数据集

然后,仍然在同一个查询中工作,我将它分成两个新表:一个 ID 表和一个 KI ID 表.您可以通过选择现有的 ID 和 Text 列来制作第一个表,然后选择 Home > Remove Columns > Remove Other Columns;但要制作第二个表格,您需要使用公式栏.

在创建第二个表之前,我在第一个表(ID 表)中添加了一列,每行的数字为 1.我称该列 ID 匹配键.

然后,我从创建它的应用步骤中复制了第一个表的公式,以在公式栏中使用它来为 KI ID 创建第二个表.我对其进行了编辑,根据 KI ID 更改了列的名称:

然后我在第二个表(KI ID 表)中添加了每行中编号为 1 的列.我将该列称为 KI ID Match Key.

然后,我对刚刚创建的两个表(ID 表和 KI ID 表)进行了完全外部合并.为此,我首先使用 Home > Merge Queries 并将 Table1 与其自身合并.(用于匹配的列无关紧要,因为这是临时的.)我选择 Full Outer 作为 Join Kind.合并完成后,我在公式栏中编辑了合并,将表格更改为 #"Added ID Match Key" 和 #"Added KI ID Match Key"(恰好是这两个表格"的名称在我创建它们并添加它们的匹配键以在此合并中使用之后)和相应的匹配字段到ID 匹配键"和KI ID 匹配键":

合并后,我展开结果列:

然后我通过选择两列(通过单击一列,然后单击并按住 Ctrl)然后单击转换 > 替换值 > 在要查找的值"中键入 null 来将 Text 和 KI Text 列中的所有空值替换为空白文本" 框并将替换为"框留空,然后单击确定".

然后我添加了列以将每个 Text 和 KI Text 列单元格拆分为它们的单词列表.不过,在进行每次拆分之前,我首先过滤掉了除小写和大写文本 a 到 z、数字 0 到 9、空格和等号之外的所有内容.我使用 Text.SplitAny 在任何空格或等号处拆分单词:

然后我添加了一个列并确定了来自 Text 的哪些单词在 KI Text 中.我使用 List.Intersect 来做到这一点.它列出了重复项,这是我想要的:

然后我添加了一个列并执行相反的操作 - 确定 KI Text 中的哪些单词在 Text 中.同样,我使用了 List.Intersect:

然后我收集了计数信息.我创建了一个新列,其中包含我创建的每个列表中的单词数(所有出现).我对文本、KI 文本、KI 文本中的文本和文本中的 KI 文本分别执行了一次.我只是使用 List.Count 来做到这一点.这实际上没有必要作为具有单独列的单独步骤来执行.我这样做只是为了清楚地看到数字.我在这四列之后的两列中生成的 M 代码没有使用这四列.我也在接下来的两列中进行了计数:

接下来,我创建了两个新列来计算 KI 文本中的文本百分比和文本中 KI 文本的百分比.我使用了这些基本公式(希望是清晰的英文,而不是 M 代码):

  • % Text in KI Text = (来自 KI Text 中的 Text 的字数/KI 文本中的字数)* 100
  • % KI Text in Text = (Number of来自 KI 文本中的文本中的单词/文本中的单词数)* 100

    一定要检查我的数学.

最后,我删除了所有不再需要的列.

在您最初的三条评论之后,我在查询中添加了一些步骤来得到这个:

这是 M 代码我在您的三条评论后所做的编辑.有关更多上下文,请参阅我的评论.:

让源 = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],#"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", type text}, {"Text", type text}, {"KI ID", type text}, {"KI Text", type text}, {"结果", 输入文本}}),#"Made ID Table" = Table.SelectColumns(#"Changed Type",{"ID", "Text"}),#"Added ID Match Key" = Table.AddColumn(#"Made ID Table", "ID Match Key",各1),#"Made KI ID Table" = Table.SelectColumns(#"Changed Type",{"KI ID", "KI Text"}),#"添加KI ID匹配键" = Table.AddColumn(#"制作KI ID表", "KI ID匹配键",各1),#"Merged Queries" = Table.NestedJoin(#"Added ID Match Key", {"ID Match Key"}, #"Added KI ID Match Key", {"KI ID Match Key"}, "KI Table", JoinKind.FullOuter),#"Expanded KI Table" = Table.ExpandTableColumn(#"Merged Queries", "KI Table", {"KI ID Match Key", "KI ID", "KI Text"}, {"KI ID Match Key", "KI ID", "KI 文本"}),#"替换值" = Table.ReplaceValue(#"扩展 KI 表",null,"",Replacer.ReplaceValue,{"Text", "KI Text"}),#"Split Text to List" = Table.AddColumn(#"Replaced Value", "Text Listed", each Text.SplitAny(Text.Select([Text],{"a".."z","A".."Z","0".."9",",","="}),"=")),#"Split KI Text to List" = Table.AddColumn(#"Split Text to List", "KI Text Listed", each Text.SplitAny(Text.Select([KI Text],{"a".."z","A".."Z","0".."9",","="}),"=")),#"Got Text in KI Text" = Table.AddColumn(#"Split KI Text to List", "Text in KI Text", each List.Intersect({[KI Text Listed],[Text Listed]})),#"Got KI Text in Text" = Table.AddColumn(#"Got Text in KI Text", "KI Text in Text", each List.Intersect({[Text Listed], [KI Text Listed]})),//你实际上并不需要接下来的四行.我将它们包括在内,以便您可以看到表中的数字.它们在下面的两个 #"Calculated..." 行中计算.//如果您选择删除它们,您需要将第一行#"Calculated..." 中的 #"Got Count of KI Text in Text" 替换为 #"Got KI Text in Text".#"Got Count of Text" = Table.AddColumn(#"Got KI Text in Text", "Text Count", each List.Count([Text Listed])),#"Got Count of KI Text" = Table.AddColumn(#"Got Count of Text", "KI Text Count", each List.Count([KI Text Listed])),#"Got Count of Text in KI Text" = Table.AddColumn(#"Got Count of KI Text", "Text in KI Text Count", each List.Count([Text in KI Text])),#"Got Count of KI Text in Text" = Table.AddColumn(#"Got Count of Text in KI Text", "KI Text in Text Count", each List.Count([KI Text in Text])),#"Calculated % Text in KI Text" = Table.AddColumn(#"Got Count of KI Text in Text", "% Text in KI Text", each Number.Round((List.Count([Text in KI Text])/List.Count([KI Text Listed]))*100, 2), type number),#"Calculated % KI Text in Text" = Table.AddColumn(#"Calculated % Text in KI Text", "% KI Text in Text", each Number.Round((List.Count([KI Text in Text])/List.Count([Text Listed]))*100, 2), type number),#"Removed Other Columns" = Table.SelectColumns(#"Calculated % KI Text in Text",{"ID", "Text", "KI ID", "KI Text", "% Text in KI Text", "%KI 文本中的文本"}),////以下是我在您的三条评论后的编辑.#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Outcome", each Text.From([#" % Text in KI Text"]) & "% match with " & [KI ID]),#"Grouped Rows" = Table.Group(#"Added Custom", {"ID"}, {{"AllData", each_, type table [ID=text, Text=text, KI ID=text, KI Text=text, #" % Text in KI Text"=number, #"% KI Text in Text"=number, Outcome=text]}}),#"Added Custom1" = Table.AddColumn(#"Grouped Rows", "Text", each [AllData][Text]{0}),#"Added Custom2" = Table.AddColumn(#"Added Custom1", "Outcome", each [AllData][Outcome]),#"Extracted Values" = Table.TransformColumns(#"Added Custom2", {"Outcome", each Text.Combine(List.Transform(_, Text.From), "#(cr)"), type text}),#"Removed Other Columns1" = Table.SelectColumns(#"Extracted Values",{"ID", "Text", "Outcome"})在#"删除其他列1"

I’m having a problem writing a query in power bi for matching two rows partially or completely. In addition to that I’m trying to calculate the match percentage if found and put the results in a new column.

The actual dataset contains a lot of rows and tables but for the sake of this example I’m using only 4 columns.

The columns 'ID' and 'Text' are uniquely identified. 'KI ID' and 'KI Test' are also unique but they are not related to columns 'ID' and 'Text' only when a match occurs.

What I need to implement is the following:

I would like to match the input of each row in the 'Text' column with each row in the 'KI text' column. If there is a match, then I would like to know the 'KI ID’ and the Match percentage. Take a look at the data set for a better insight.

ps: Is this actually achievable with power query or is it just a fantasy because in my perspective I’m heading towards machine learning, I think?

Data set https://drive.google.com/open?id=1JrsxPa6DICNi5N-tI5ESukh62W_uedaQ

enter image description here

The match calculation is based on the amount of words that occurs in both columns, 'Text' and 'KI Text'. for example, if one of the rows in the 'Text' column contains two sentences and these sentences partially matches with one of the 'KI Text' rows which has like 6 sentences in total. The match between the rows is partially so basically it should calculate it as 2/6 so it's like 33,3% match.

In addition, the 'KI Text' column contains a lot of rows that could passably match with one of the ‘Text' column rows. Only if it's greater or equals to 80% then it should show the results otherwise it’s not interesting.

解决方案

Hopefully, this is what you're looking for, or at least brings you closer to an answer.

While I believe your ID / Text and KI ID / KI Text data most likely come from two different tables, you presented them in one spreadsheet as your data set, so I started with that. The only content available to me in your spreadsheet was Sheet2. I simply copied and pasted the contents of your spreadsheet's Sheet2 to my own spreadsheet in Excel. I then used the default table name of Table1 to refer to it. I brought it in just like you presented it:

Then, still working within the same query, I separated it into two new tables: an ID table and a KI ID table. You can make the first table by selecting the existing ID and Text columns, then Home > Remove Columns > Remove Other Columns; but to make the second table, you would need to use the formula bar.

Before I made the second table, I added a column to the first table (the ID table) with the number 1 in every row. I called that column ID Match Key.

Then I copied the first table's formula from the applied step where I had created it, to use it in the formula bar to create the second table for KI ID. I edited it, changing the column's names as appropriate for KI ID:

Then I added a column with the number 1 in every row to the second table (the KI ID table). I called that column KI ID Match Key.

Then I did a full outer merge of the two tables that I had just made--the ID and KI ID tables. To do this, I first used Home > Merge Queries and merged Table1 with itself. (Which column to use for a match doesn't matter, because this is temporary.) I selected Full Outer as the Join Kind. Once the merge was done, I edited the merge in the formula bar, to change the tables to #"Added ID Match Key" and #"Added KI ID Match Key" (which happen to be the names of the those two "tables" after I created them and added their match keys to use in this merge) and the respective matching fields to "ID Match Key" and "KI ID Match Key":

After the merge, I expanded the resultant column:

Then I replaced all nulls in both the Text and KI Text columns with blank text by selecting both columns (by clicking one column, then clicking and holding Ctrl) then clicking Transform > Replace Values > typing null in the "Value To Find" box and leaving the "Replace With" box blank, and clicking OK.

Then I added columns to split each of the Text and KI Text columns cells into lists of their words. Before I did each split though, I first filtered out everything except lowercase and capital text a through z, numbers 0 through 9, spaces, and equal signs. I used Text.SplitAny to split the words at any spaces or equals signs:

Then I added a column and determined what words from Text were in KI Text. I used List.Intersect to do that. It lists the duplicates, which is something I wanted:

Then I added a column and did the reverse--determined what words from KI Text were in Text. Again, I used List.Intersect:

Then I collected count information. I created a new column with the count of the number of words (all occurrences) in each of the lists I had created. I did this once each for Text, KI Text, Text in KI Text, and KI Text in Text. I simply used List.Count to do this. This wasn't actually necessary to do as separate steps with separate columns. I just did it to clearly see the numbers. The M code I generated in the two columns that follow these four columns doesn't use these four columns. I did the counting in those next two columns as well:

Next, I created two new columns to calculate the % of Text in KI Text and the % of KI Text in Text. I used these basic formulas (presented here in, hopefully, clear english, versus the M code):

  • % Text in KI Text = (Number of words from Text that are in KI Text / Number of words in KI Text) * 100
  • % KI Text in Text = (Number of words from KI Text that are in Text / Number of words in Text) * 100

    Be sure to check my math.

Last, I removed all the columns I no longer wanted.

After your initial three comments, I added some steps to my query to get this:

Here's the M code with the edits that I made after your three comments. See my comments for more context.:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", type text}, {"Text", type text}, {"KI ID", type text}, {"KI Text", type text}, {"Outcome", type text}}),
    #"Made ID Table" = Table.SelectColumns(#"Changed Type",{"ID", "Text"}),
    #"Added ID Match Key" = Table.AddColumn(#"Made ID Table", "ID Match Key", each 1),
    #"Made KI ID Table" = Table.SelectColumns(#"Changed Type",{"KI ID", "KI Text"}),
    #"Added KI ID Match Key" = Table.AddColumn(#"Made KI ID Table", "KI ID Match Key", each 1),
    #"Merged Queries" = Table.NestedJoin(#"Added ID Match Key", {"ID Match Key"}, #"Added KI ID Match Key", {"KI ID Match Key"}, "KI Table", JoinKind.FullOuter),
    #"Expanded KI Table" = Table.ExpandTableColumn(#"Merged Queries", "KI Table", {"KI ID Match Key", "KI ID", "KI Text"}, {"KI ID Match Key", "KI ID", "KI Text"}),
    #"Replaced Value" = Table.ReplaceValue(#"Expanded KI Table",null,"",Replacer.ReplaceValue,{"Text", "KI Text"}),
    #"Split Text to List" = Table.AddColumn(#"Replaced Value", "Text Listed", each Text.SplitAny(Text.Select([Text],{"a".."z","A".."Z","0".."9"," ","="})," =")),
    #"Split KI Text to List" = Table.AddColumn(#"Split Text to List", "KI Text Listed", each Text.SplitAny(Text.Select([KI Text],{"a".."z","A".."Z","0".."9"," ","="})," =")),
    #"Got Text in KI Text" = Table.AddColumn(#"Split KI Text to List", "Text in KI Text", each List.Intersect({[KI Text Listed],[Text Listed]})),
    #"Got KI Text in Text" = Table.AddColumn(#"Got Text in KI Text", "KI Text in Text", each List.Intersect({[Text Listed], [KI Text Listed]})),
    //You don't actually need the next four lines. I included them so you can see the numbers in the table. They are calculated in the two #"Calculated..." lines below.
    //If you choose to remove them, you'll need to replace #"Got Count of KI Text in Text", in the first #"Calculated..." line, with #"Got KI Text in Text".
    #"Got Count of Text" = Table.AddColumn(#"Got KI Text in Text", "Text Count", each List.Count([Text Listed])),
    #"Got Count of KI Text" = Table.AddColumn(#"Got Count of Text", "KI Text Count", each List.Count([KI Text Listed])),
    #"Got Count of Text in KI Text" = Table.AddColumn(#"Got Count of KI Text", "Text in KI Text Count", each List.Count([Text in KI Text])),
    #"Got Count of KI Text in Text" = Table.AddColumn(#"Got Count of Text in KI Text", "KI Text in Text Count", each List.Count([KI Text in Text])),
    #"Calculated % Text in KI Text" = Table.AddColumn(#"Got Count of KI Text in Text", " % Text in KI Text", each Number.Round((List.Count([Text in KI Text])/List.Count([KI Text Listed]))*100, 2), type number),
    #"Calculated % KI Text in Text" = Table.AddColumn(#"Calculated % Text in KI Text", "% KI Text in Text", each Number.Round((List.Count([KI Text in Text])/List.Count([Text Listed]))*100, 2), type number),
    #"Removed Other Columns" = Table.SelectColumns(#"Calculated % KI Text in Text",{"ID", "Text", "KI ID", "KI Text", " % Text in KI Text", "% KI Text in Text"}),
    //
    //What follows is my edit after your three comments.
    #"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Outcome", each Text.From([#" % Text in KI Text"]) & "% match with " & [KI ID]),
    #"Grouped Rows" = Table.Group(#"Added Custom", {"ID"}, {{"AllData", each _, type table [ID=text, Text=text, KI ID=text, KI Text=text, #" % Text in KI Text"=number, #"% KI Text in Text"=number, Outcome=text]}}),
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", "Text", each [AllData][Text]{0}),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "Outcome", each [AllData][Outcome]),
    #"Extracted Values" = Table.TransformColumns(#"Added Custom2", {"Outcome", each Text.Combine(List.Transform(_, Text.From), "#(cr)"), type text}),
    #"Removed Other Columns1" = Table.SelectColumns(#"Extracted Values",{"ID", "Text", "Outcome"})
in
    #"Removed Other Columns1"

这篇关于匹配两列之间的行以获得完全匹配或部分匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆