如何在CSV文件中查找为主键候选集设置的列? [英] How to find a columns set for a primary key candidate in CSV file?
本文介绍了如何在CSV文件中查找为主键候选集设置的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个CSV文件(未规范化,例如,实际文件最多100列):
I have a CSV file (not normalized, example, real file up to 100 columns):
ID, CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
1, CUST1, CLIENT1, 10, 2018-04-01, 2018-04-02
2, CUST1, CLIENT1, 10, 2018-04-01, 2018-05-30
3, CUST1, CLIENT1, 101, 2018-04-02, 2018-04-03
4, CUST2, CLIENT1, 102, 2018-04-02, 2018-04-03
如何找到可用作主键的所有可能的列集.
How can I find ALL possible sets of columns which could be used as Primary key.
所需的输出:
1) ID
2) PAYMENT_NUM,START_DATE,END_DATE
3) CUST_NAME, CLIENT_NAME, PAYMENT_NUM,START_DATE,END_DATE
我可以用Java做到这一点,但也许Python/Pandas已经提供了快速解决方案
推荐答案
pandas和itertools将为您提供所需的内容.
pandas and itertools will give you what you're looking for.
import pandas
from itertools import chain, combinations
def key_options(items):
return chain.from_iterable(combinations(items, r) for r in range(1, len(items)+1) )
df = pandas.read_csv('test.csv');
# iterate over all combos of headings, excluding ID for brevity
for candidate in key_options(list(df)[1:]):
deduped = df.drop_duplicates(candidate)
if len(deduped.index) == len(df.index):
print ','.join(candidate)
这将为您提供输出:
PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, END_DATE
CUST_NAME, PAYMENT_NUM, END_DATE
CLIENT_NAME, PAYMENT_NUM, END_DATE
PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, START_DATE, END_DATE
CUST_NAME, PAYMENT_NUM, START_DATE, END_DATE
CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
这篇关于如何在CSV文件中查找为主键候选集设置的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文