LSTM项目与CSV格式不兼容 [英] LSTM project not compatible with CSV format
问题描述
我正在尝试复制Chevalier的 LSTM人类活动识别算法,当我尝试以CSV格式实现自己的数据时遇到了一个问题. git中使用的格式是txt.我的CSV数据具有以下格式:
I am trying to replicate Chevalier's LSTM Human Activity Recognition algorithm and came across a problem when I was trying to implement my own data in a CSV format. The format used in the git was txt. My CSV data is of the following format:
0.000995,8
0.020801,8
0.040977,8
0.060786,8
0.080970,8
... ...
可以在此处找到原始文件. x值(时间)在第0列中(-80.060003等),y值(值)在第1列中(8、8等).我试图用熊猫
The original file can be found here. The x-values (time) are in column 0 (-80.060003, etc.) and the y-values (value) are in column 1 (8, 8, etc.). I tried to use pandas
pandas.read_csv(DATASET_PATH + TRAIN + "data_train.csv", skiprows=1, header=None, sep=',', usecols=[0, 1])
,但它似乎与准备数据集"部分(以及其他可能的数据)中的数据格式不兼容:
but it does not seem to be compatible with the format of the data in the "Prepare Dataset" section (and possibly others as well):
TRAIN = "train/"
TEST = "test/"
# Load "X" (the neural network's training and testing inputs)
def load_X(X_signals_paths):
X_signals = []
for signal_type_path in X_signals_paths:
file = open(signal_type_path, 'r')
# Read dataset from disk, dealing with text files' syntax
X_signals.append(
[np.array(serie, dtype=np.float32) for serie in [
row.replace(' ', ' ').strip().split(' ') for row in file
]]
)
file.close()
return np.transpose(np.array(X_signals), (1, 2, 0))
X_train_signals_paths = [
DATASET_PATH + TRAIN + "Inertial Signals/" + signal + "train.txt" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + "Inertial Signals/" + signal + "test.txt" for signal in INPUT_SIGNAL_TYPES
]
X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
# Load "y" (the neural network's training and testing outputs)
def load_y(y_path):
file = open(y_path, 'r')
# Read dataset from disk, dealing with text file's syntax
y_ = np.array(
[elem for elem in [
row.replace(' ', ' ').strip().split(' ') for row in file
]],
dtype=np.int32
)
file.close()
# Substract 1 to each output class for friendly 0-based indexing
return y_ - 1
y_train_path = DATASET_PATH + TRAIN + "y_train.txt"
y_test_path = DATASET_PATH + TEST + "y_test.txt"
y_train = load_y(y_train_path)
y_test = load_y(y_test_path)
这就是我通过iPython3实现的过程:
This was what is happening with my implementation via iPython3:
在[0]中:
TRAIN = "train/"
TEST = "test/"
def load_X(X_signals_paths):
X_signals = []
for signal_type_path in X_signals_paths:
file = pandas.read_csv(DATASET_PATH + TRAIN + "data_train.csv", skiprows=1, header=None, sep=',', usecols=[0])
X_signals.append(
[np.array(serie, dtype=np.float32) for serie in [
str(row).replace(' ', ' ').strip().split(' ') for row in file
]]
)
return np.transpose(np.array(X_signals), (1, 2, 0))
_train_signals_paths = [
DATASET_PATH + TRAIN + signal + "train.csv" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + signal + "test.csv" for signal in INPUT_SIGNAL_TYPES
]
X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
print(X_train, X_test)
退出[0]:
[[[ 0.]]] [[[ 0.]]]
我希望我可以得到一些有关正确格式化数据以使其与该算法无缝协作的帮助.如有任何疑问,请告诉我.
I hope that I could receive some help with properly formatting my data to work seamlessly with this algorithm. If there are any questions please let me know.
推荐答案
跟踪中的代码与您在问题中实际发布的代码不同-正常工作的代码在裸文件句柄上运行,而不是在Pandas数据帧上运行
The code in the trace differs from the code you actually posted in the question -- the working code is operating on a bare file handle, not a Pandas data frame.
作为参考,这是您再次引用的项目中的代码:
For reference, here is the code from the project you are referring to again:
def load_X(X_signals_paths):
X_signals = []
for signal_type_path in X_signals_paths:
file = open(signal_type_path, 'r')
# ^ the error comes where you have file = pandas.read_csv(...)
# Read dataset from disk, dealing with text files' syntax
X_signals.append(
[np.array(serie, dtype=np.float32) for serie in [
row.replace(' ', ' ').strip().split(' ') for row in file
]]
)
file.close()
file
只是一个迭代器,它返回以换行符结尾的原始行(字符序列);在此输入上,删除换行符并压缩空格是有意义的.但是您的代码已经打开,解析并将文件的内容重新格式化为Pandas数据框,该数据框没有换行符或空格,只有已解析的数字.也许回到上游代码;或者,如果您要在其中进行更改,请弄清楚如何提出更改要求.这样的CSV没什么问题.
file
is just an iterator which returns a raw line (a sequence of characters) ending with a newline; on this input, it makes sense to strip newlines and squeeze spaces. But your code already opens, parses, and reformats the contents of the file into a Pandas data frame, which doesn't have newlines or spaces, just the numbers already parsed. Maybe fall back to the upstream code; or if there is something you want to change in there, figure out how to ask about that change. There's nothing wrong with the CSV as such.
Python具有功能强大的 csv
模块,所以也许只需使用而不是手动从CSV中解析出各个字段.
Python has a quite capable csv
module so maybe simply use that instead of manually parsing out the individual fields from the CSV.
for signal_type_path in X_signals_paths:
with open(signal_type_path, 'r') as csvfile:
reader = csv.reader(csvfile)
X_signals.append([np.array(row[0:2], dtype=np.float32) for row in reader])
或者作为最小的更改,以逗号而不是空格分隔. (您的数据看上去实际上并不需要删除空格.)
Or as a minimal change, split on commas instead of spaces. (Your data looks like you don't actually need to remove spaces then.)
同样,您的代码也会相切地对读取的文件进行硬编码.最好将DATASET_PATH
和TRAIN
参数完全保留在调用代码中,并让load_X
简单地接受完整文件路径的列表,它接受而不用任何方式对其进行修改的完整文件路径.
Also, tangentially, your code hardcodes the file it reads. It's probably better to keep the DATASET_PATH
and TRAIN
parameters entirely in the calling code, and have load_X
simply accept a list of full file paths, which it accepts without modifying them in any way.
这篇关于LSTM项目与CSV格式不兼容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!