使用python解析CSV文件(稍后制作决策树) [英] Parse a CSV file using python (to make a decision tree later)

查看:443
本文介绍了使用python解析CSV文件(稍后制作决策树)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,完全公开:这是一个uni分配,所以我不想接收代码。 :)。我更多地寻找方法;我非常是新的python,读过一本书,但还没有写任何代码。

First off, full disclosure: This is going towards a uni assignment, so I don't want to receive code. :). I'm more looking for approaches; I'm very new to python, having read a book but not yet written any code.

整个任务是导入内容CSV文件,从CSV文件的内容创建决策树(使用 ID3算法 ),然后解析第二个CSV文件以针对树运行。有一个大的(可理解的)偏好,它有能力处理不同的CSV文件(我问,如果我们是否允许硬编码的列名,主要是消除它作为一种可能性,答案是否定的)。

The entire task is to import the contents of a CSV file, create a decision tree from the contents of the CSV file (using the ID3 algorithm), and then parse a second CSV file to run against the tree. There's a big (understandable) preference to have it capable of dealing with different CSV files (I asked if we were allowed to hard code the column names, mostly to eliminate it as a possibility, and the answer was no).

CSV文件采用相当标准的格式;标题行标有一个#,然后显示列名,其后的每一行都是一系列简单的值。示例:

The CSV files are in a fairly standard format; the header row is marked with a # then the column names are displayed, and every row after that is a simple series of values. Example:

# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14

目前,我想解决第一部分:解析CSV。为了做决策树的决策,字典结构似乎是最合乎逻辑的;所以我想这样做:

At the moment, I'm trying to work out the first part: parsing the CSV. To make the decisions for the decision tree, a dictionary structure seems like it's going to be the most logical; so I was thinking of doing something along these lines:

Read in each line, character by character
If the character is not a comma or a space
    Append character to temporary string
If the character is a comma
    Append the temporary string to a list
    Empty string
Once a line has been read
    Create a dictionary using the header row as the key (somehow!)
    Append that dictionary to a list

但是,如果我这样做,我不知道如何在键和值之间进行映射。我也想知道是否有一些方法来对列表中的每个字典执行一个操作,因为我需要做的事情的效果每个人都返回他们的值列Column1和Column4,所以我可以计数谁有什么! - 我认为有一些机制,但我不认为我知道如何做。

However, if I do things that way, I'm not sure how to make a mapping between the keys and the values. I'm also wondering whether there is some way to perform an action on every dictionary in a list, since I'll need to be doing things to the effect of "Everyone return their values for columns Column1 and Column4, so I can count up who has what!" - I assume that there is some mechanism, but I don't think I know how to do it.

字典是最好的方法吗?我会使用一些其他数据结构做更好的事情吗?如果是,什么?

Is a dictionary the best way to do it? Would I be better off doing things using some other data structure? If so, what?

推荐答案

Python有一些非常强大的语言结构。您可以从文件中读取行:

Python has some pretty powerful language constructs builtin. You can read lines from a file like:


with open(name_of_file,"r") as file:
    for line in file:
         # process the line

您可以使用 string.split 函数以逗号分隔行,您可以使用 string.strip 来消除中间的空格。 Python具有非常强大的列表词典

You can use the string.split function to separate the line along commas, and you can use string.strip to eliminate intervening whitespace. Python has very powerful lists and dictionaries.

要创建列表,你只需使用[]等空括号,而使用{}创建一个空字典:

To create a list, you simply use empty brackets like [], while to create an empty dictionary you use {}:


mylist = []; # Creates an empty list
mydict = {}; # Creates an empty dictionary

您可以使用.append()函数插入到列表中,同时可以使用索引下标插入词典。例如,您可以使用 mylist.append(5)将5添加到列表中,而您可以使用 mydict [key] = value 与值 value 相关联。要测试字典中是否存在键,可以使用中的关键字。例如:

You can insert into the list using the .append() function, while you can use indexing subscripts to insert into the dictionary. For example, you can use mylist.append(5) to add 5 to the list, while you can use mydict[key]=value to associate the key key with the value value. To test whether a key is present in the dictionary, you can use the in keyword. For example:


if key in mydict:
   print "Present"
else:
   print "Absent"

要遍历列表或字典的内容,可以直接使用for循环:

To iterate over the contents of a list or dictionary, you can simply use a for-loop as in:


for val in mylist:
    # do something with val

for key in mydict:
    # do something with key or with mydict[key]

由于在许多情况下,迭代时需要同时具有值​​和索引在列表中,还有一个名为enumerate的内置函数,可以节省您自己计数索引的麻烦:

Since, in many cases, it is necessary to have both the value and index when iterating over a list, there is also a builtin function called enumerate that saves you the trouble of counting indices yourself:


for idx, val in enumerate(mylist):
    # do something with val or with idx. Note that val=mylist[idx]

上面的代码在功能上与以下代码相同:

The code above is identical in function to:


idx=0
for val in mylist:
   # process val, idx
   idx += 1

如果你这样选择,你也可以遍历索引:

You could also iterate over the indices if you so chose:


for idx in xrange(len(mylist)):
    # Do something with idx and possibly mylist[idx]

列表中的元素数量或字典中的键数量,使用 len

Also, you can get the number of elements in a list or the number of keys in a dictionary using len.

可以通过使用列表推导对字典或列表的每个元素执行操作;然而,我建议你只需使用for-loops完成这个任务。但是,作为示例:

It is possible to perform an operation on each element of a dictionary or list via the use of list comprehension; however, I would recommend that you simply use for-loops to accomplish that task. But, as an example:


>>> list1 = range(10)
>>> list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list2 = [2*x for x in list1]
>>> list2
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

建议您阅读 Python教程,以获得更深入的知识。

When you have the time, I suggest you read the Python tutorial to get some more in-depth knowledge.

这篇关于使用python解析CSV文件(稍后制作决策树)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆