如何将记录数组从arff文件转换为ndarray? [英] How can I turn record array from arff file into ndarray?

查看:160
本文介绍了如何将记录数组从arff文件转换为ndarray?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ARFF文档告诉我的文件被读取为记录数组,但我似乎无法像普通记录数组一样将其转换为ndarray。应该有具有55个功能的11055个示例。

ARFF documentation tells me that my file is being read as a record array but I can't seem to convert it to ndarray like a normal record array. There should be 11055 examples with 31 features.

>>> dataset.shape
(11055,)
>>> dataset[0]
(b'1', b'1', b'1', b'1', b'1', b'-1', b'1', b'1', b'-1', b'1', b'1', b'1', b'1', b'0', b'0', b'-1', b'1', b'1', b'0', b'1', b'1', b'1', b'1', b'1', b'1', b'1', b'1', b'1', b'0', b'1', b'1')
>>> dataset.dtype
dtype([('having_IP_Address', 'S2'), ('URL_Length', 'S2'), ('Shortining_Service', 'S2'), ('having_At_Symbol', 'S2'), ('double_slash_redirecting', 'S2'), ('Prefix_Suffix', 'S2'), ('having_Sub_Domain', 'S2'), ('SSLfinal_State', 'S2'), ('Domain_registeration_length', 'S2'), ('Favicon', 'S2'), ('port', 'S2'), ('HTTPS_token', 'S2'), ('Request_URL', 'S2'), ('URL_of_Anchor', 'S2'), ('Links_in_tags', 'S2'), ('SFH', 'S2'), ('Submitting_to_email', 'S2'), ('Abnormal_URL', 'S2'), ('Redirect', 'S1'), ('on_mouseover', 'S2'), ('RightClick', 'S2'), ('popUpWidnow', 'S2'), ('Iframe', 'S2'), ('age_of_domain', 'S2'), ('DNSRecord', 'S2'), ('web_traffic', 'S2'), ('Page_Rank', 'S2'), ('Google_Index', 'S2'), ('Links_pointing_to_page', 'S2'), ('Statistical_report', 'S2'), ('Result', 'S2')])

基本上,我试图将存储在 dataset 中的该记录数组转换为ndarray并对其进行整形以匹配矢量尺寸。问题似乎是我留下的ndarray是具有长记录dtype的对象列表,而不是列表列表。我只是不确定如何将dtype转换为列表。

Basically, I am trying to turn this record array stored in dataset into a ndarray and reshape it to match the vector dimensions. The problem seems to be that the ndarray that I am left with is a list of objects with that long record dtype rather than a list of lists. I am just not sure how to convert that dtype into a list.

from scipy.io import arff
import urllib.request
import io
import numpy as np

# this just reads the arff from its URL 
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Training%20Dataset.arff"
ftpstream = urllib.request.urlopen(url)
dataset, meta = arff.loadarff(io.StringIO(ftpstream.read().decode('utf-8')))

num_features = len(meta.names())
num_examples = dataset.shape[0]
dataset.view(np.ndarray).reshape(num_examples, num_features)

最后一行导致错误 ValueError:无法将大小为11055的数组重塑为形状(11055,31)

我最终希望得到的结果是一个形状(11055,31)和数字dtype的ndarray。

What I am ultimately looking to end up with is a ndarray with shape(11055,31) and a numeric dtype.

您可以找到数据此处。但是文件的外观如下:

You can find the data here. But here is what the file looks like:

@relation phishing

@attribute having_IP_Address  { -1,1 }
@attribute URL_Length   { 1,0,-1 }
@attribute Shortining_Service { 1,-1 }
@attribute having_At_Symbol   { 1,-1 }
@attribute double_slash_redirecting { -1,1 }
@attribute Prefix_Suffix  { -1,1 }
@attribute having_Sub_Domain  { -1,0,1 }
@attribute SSLfinal_State  { -1,1,0 }
@attribute Domain_registeration_length { -1,1 }
@attribute Favicon { 1,-1 }
@attribute port { 1,-1 }
@attribute HTTPS_token { -1,1 }
@attribute Request_URL  { 1,-1 }
@attribute URL_of_Anchor { -1,0,1 }
@attribute Links_in_tags { 1,-1,0 }
@attribute SFH  { -1,1,0 }
@attribute Submitting_to_email { -1,1 }
@attribute Abnormal_URL { -1,1 }
@attribute Redirect  { 0,1 }
@attribute on_mouseover  { 1,-1 }
@attribute RightClick  { 1,-1 }
@attribute popUpWidnow  { 1,-1 }
@attribute Iframe { 1,-1 }
@attribute age_of_domain  { -1,1 }
@attribute DNSRecord   { -1,1 }
@attribute web_traffic  { -1,0,1 }
@attribute Page_Rank { -1,1 }
@attribute Google_Index { 1,-1 }
@attribute Links_pointing_to_page { 1,0,-1 }
@attribute Statistical_report { -1,1 }
@attribute Result  { -1,1 }


@data
-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,-1,0,1,-1,1,1,-1,1,0,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,1,1,-1
1,0,1,1,1,-1,-1,-1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,1,-1,1,0,-1,-1
1,0,1,1,1,-1,-1,-1,1,1,1,-1,-1,0,0,-1,1,1,0,1,1,1,1,-1,-1,1,-1,1,-1,1,-1
1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1,1
-1,0,-1,1,-1,-1,1,1,-1,1,1,-1,1,0,0,-1,-1,-1,0,1,1,1,1,1,1,1,-1,1,-1,-1,1
1,0,-1,1,1,-1,-1,-1,1,1,1,1,-1,-1,0,-1,-1,-1,0,1,1,1,1,1,-1,-1,-1,1,0,-1,-1
1,0,1,1,1,-1,-1,-1,1,1,1,-1,-1,0,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,0,1,-1
1,0,-1,1,1,-1,1,1,-1,1,1,-1,1,0,1,-1,1,1,0,1,1,1,1,1,-1,1,1,1,0,1,1
1,1,-1,1,1,-1,-1,1,-1,1,1,1,1,0,1,-1,1,1,0,1,1,1,1,1,-1,0,-1,1,0,1,-1
1,1,1,1,1,-1,0,1,1,1,1,1,-1,0,0,-1,-1,-1,0,1,1,1,1,-1,1,1,1,1,-1,-1,1
1,1,-1,1,1,-1,1,-1,-1,1,1,1,1,-1,-1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,0,-1,-1
-1,1,-1,1,-1,-1,0,0,1,1,1,-1,-1,-1,1,-1,1,1,0,-1,1,-1,1,1,-1,-1,-1,1,0,1,-1
1,1,-1,1,1,-1,0,-1,1,1,1,1,-1,-1,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,1,1,-1
1,1,-1,1,1,1,-1,1,-1,1,1,-1,1,0,1,1,1,1,0,1,1,1,1,1,-1,1,-1,1,-1,1,1
1,-1,-1,-1,1,-1,0,0,1,1,1,1,-1,-1,0,-1,1,1,0,1,1,1,1,1,-1,-1,-1,1,0,1,-1
1,-1,-1,1,1,-1,1,1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,0,-1,1,1,-1,-1


推荐答案

在文件上,我们可以看到所有字段都是分类类型的,而不是数字类型的。除此之外,您的数组是带有复杂dtype的常规 ndarray 。由于那是您不能更改的事情,因此必须转换数组的结构和dtype。最整洁的方法(虽然不是最有效的方法)是

Looking at the file, we can see that all the fields are of categorical type, rather than numeric. Aside from that, your array is a regular ndarray with a complicated dtype. Since that's not something you can change, you will have to convert the structure and dtype of your array. The neatest approach (although not the most efficient) would be

dataset = np.array(dataset.tolist(), dtype=np.int8)

tolist 将转换为数组放入元组列表中,简单的dtype int8 随后将使它们重新组合成常规数组。

tolist will convert the array into a list of tuples, which the simple dtype int8 will then cause to be reassembled into a regular array.

此问题是将字符串字段的numpy数组转换为数字格式的基础

这篇关于如何将记录数组从arff文件转换为ndarray?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆