基于名称对项目进行分组的方法 [英] approach for grouping items based on name

查看:167
本文介绍了基于名称对项目进行分组的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嘿,目前我有一个程序,可以从网上银行下载交易历史记录的CSV文件,对商品进行分组,并对来自特定卖方或其他任何实体的所有条目进行总计.

但是,从某种意义上说,某些条目包含额外的信息(例如位置,因为它们不完全匹配而无法分组),在CSV中列出名称的方式存在一些问题.

例如. ***和***组,但是*** Malvern和*** ashburton并不是因为额外的信息.

为了尝试解决此问题,我想出了一种方法,即在条目的开头查找匹配的子字符串,然后使用子字符串将它们分组,以便*** malvern和*** ashburton将分组在***下.

然后,我遇到了一个问题,其中我有几个带有cole #####的条目,还有几个带有coles express ###的条目,而coles express首先分组,因此不会与coles分组.

然后,如果条目的第一个单词相同,则将其更改为分组,除非第一个单词为"the".除了条目之类的事实之外,这一切正常.只是牛仔裤和工具就会归类为正义".

我现在对如何区分要分组的条目和不分组的条目没有任何想法.任何建议都很棒.

谢谢,
mata89

#Edit#
感谢您的回答,唯一的问题是我看不到如何避免将"just Jeans"和"just tools"等条目归为"just".也许我不太了解您,如果您这么精打细算的话.

还是谢谢你,
mata89

Hey Currently I have a program that reads in a CSV of a transaction history, downloaded from internet banking, groups items and adds up totals for all the entries from a particular seller or whatever.

However there are some problems with the way the names are listed in the CSV, in the sense that some entries contain extra information such as location that don''t group because they are not an exact match.

eg. *** and *** group, but *** malvern and *** ashburton do not because of the extra information.

To try and fix this I came up with the approach of finding matching substrings at the start of the entries and grouping them using the substring so *** malvern and *** ashburton would group under ***.

I then encoutered a problem where I had a few entries with cole ##### and a few with coles express ### and coles express was grouping first and therefore would not group with coles.

I then changed it to group if the first word of the entries were the same, unless the first word was ''the''. This worked alright except for the fact the entries like eg. just jeans and just tools would group under ''just''.

I''m now out of ideas as to how to distinguish the entries to group and the entries to not group. Any suggestions would be awesome.

Thanks,
mata89

#Edit#
Thanks for the answer, the only problem is I don''t see how that would avoid the grouping of entries such as "just jeans" and "just tools" into "just". Maybe I''m not understanding you completely, if so coul you elaborate.

Thanks anyway,
mata89

推荐答案

一个选项可能是;

1)扫描所有条目并从中创建单词列表
2)扫描单词列表以查找例外列表中的条目,然后删除例如" the","和"
3)扫描所有条目,并对世界列表中的每个单词使用contains,如果发现与世界列表匹配,则对此分组.
4)然后,您可以删除该条目以使它不会再次与单词列表进行比较,或者允许同一供应商进行不同分组(尽管如此)这种情况很少发生
One option might be to;

1) Scan all entries and create a word list out of it
2) Scan the word list for entries which are in an exception list and remove, e.g. ''the'', ''and''
3) Scan all the entries and using a contains against each word in the world list, if match found against the world list, then group on this.
4) You could then either, remove the entry so it doesn''t get compared against the word list again, or allow different grouping for the same vendor (although) this would rarely occur


这篇关于基于名称对项目进行分组的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆