如何制作唯一的列表单元格? [英] How can I make an unique list cells?

查看:59
本文介绍了如何制作唯一的列表单元格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个txt文件,如下所示,以4行为例,每行字符串之间用,分隔.

I have a txt file which looks like below including 4 rows as an example and each row strings are separated by a ,.

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

https://gist.github.com/anonymous/cee79db7029a7d4e46cc4a7e92c59c50

可以从此处下载文件

我想提取所有output2中的所有唯一单元格

I want to extract all unique cells across all , the output2

   India1
   India2
   myIndia
   Where
   Here
   India
   uyete
   AFD 
   TTT

我试图逐行读取并打印它-如果我将数据称为df

I tried to read line by line and print it ìf i call my data as df`

myfile = open("df.txt")
lines = myfile.readlines()
for line in lines:
   print lines

推荐答案

选项1:.csv.txt文件

Option 1: .csv, .txt Files

原生Python无法读取.xls文件.如果将文件转换为.csv.txt,则可以使用标准库中的csv模块:

Native Python is unable to read .xls files. If you convert your file(s) to .csv or .txt, you can use the csv module within the Standard Library:

# `csv` module, Standard Library
import csv

filepath = "./test.csv"

with open(filepath, "r") as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)                                  # skip 'A', 'B'
    items = set()
    for line in reader:
        line = [word.replace(" ", "") for word in line if word]
        line = filter(str.strip, line)
        items.update(line)

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']


选项2:.xls.xlsx文件


Option 2: .xls, .xlsx Files

如果要保留原始的.xls格式,则必须安装第三方模块处理Excel文件.

If you want to retain the original .xls format, you have to install a third-party module to handle Excel files.

在命令提示符下安装xlrd:

pip install xlrd

在Python中:

# `xlrd` module, third-party
import itertools
import xlrd

filepath = "./test.xls"

with xlrd.open_workbook(filepath) as workbook:
    worksheet = workbook.sheet_by_index(0)                 # assumes first sheet
    rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
    cells = itertools.chain.from_iterable(rows)
    items = list({val.replace(" ", "") for val in cells if val})

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']


选项3:DataFrames

您可以使用pandas DataFrames处理csv和文本文件. 有关其他格式,请参见文档.

You can handle csv and text files with pandas DataFrames. See documentation for other formats.

import pandas as pd
import numpy as np

# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
filepath = "./test2.txt"

# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
stack = df.stack()
clean_df = pd.Series(stack.unique())
clean_df

DataFrame输出

DataFrame Output

0     India1
1     India2
2    myIndia
3      Where
4       Here
5      India
6      uyete
7        AFD
8        TTT
dtype: object

另存为文件

# Save as .txt or .csv without index, optional

# target = "./output.csv"
target = "./output.txt"
clean_df.to_csv(target, index=False)

注意:选项1和&的结果;也可以使用pd.Series(list(items))将2转换为无序的熊猫柱状对象.

Note: Results from options 1 & 2 can be converted to unordered, pandas columnar objects too with pd.Series(list(items)).

最后:作为脚本

将以上三个选项中的任何一个保存在文件(名为restack.py)中的函数(stack)中.将此脚本保存到目录中.

Save any of the three options above in a function (stack) within a file (named restack.py). Save this script to a directory.

# restack.py
import pandas as pd
import numpy as np

def stack(filepath, save=False, target="./output.txt"):
    # Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8

    # Determines columns from the first line, so add commas in text file, else may throw an error
    df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
    df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
    stack = df.stack()
    clean_df = pd.Series(stack.unique())

    if save:
        clean_df.to_csv(target, index=False)
        print("Your results have been saved to '{}'".format(target))

    return clean_df

if __name__ == "__main__":
    # Set up input prompts
    msg1 = "Enter path to input file e.g. ./test.txt: "
    msg2 = "Save results to a file? y/[n]: "

    try:
        # Python 2
        fp = raw_input(msg1)
        result = raw_input(msg2)
    except NameError:
        # Python 3
        fp = input(msg1)
        result = input(msg2)

    if result.startswith("y"):
        save = True
    else:
        save = False

    print(stack(fp, save=save))

在其工作目录中,通过命令行运行脚本.回答提示:

From its working directory, run the script via commandline. Answer the prompts:

> python restack.py 

Enter path to input file e.g. ./test.txt: ./@data/test2.txt
Save results to a file? y/[n]: y
Your results have been saved to './output.txt'

您的结果应在您的控制台中打印,并且可以选择保存到文件output.txt.调整任何参数以适合您的兴趣.

Your results should print in you console and optionally save to a file output.txt. Adjust any parameters to suit your interests.

这篇关于如何制作唯一的列表单元格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆