得到一个真正的子矩阵 [英] getting a submatrix of all true

查看:64
本文介绍了得到一个真正的子矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个较大的数据集(1000个观察值×100个浮点数/ b $ b变量),并且缺少一些数据。我想尝试一下

各种聚类,神经网络等数据算法,

并保持生命简单我想减少矩阵的维数

因此我没有缺失值,因为并非所有算法都能够处理它们并且

变量中有足够的冗余我可以承受失去一些。


我目前正在使用一个有效的黑客,但它让我想知道是否有一个最佳的解决方案。我将最佳定义为删除行

和列,使得没有缺失值和

max(numRows * numCols)。


我目前的方法是删除超过

某些预定数量的缺失变量的行(观察值),然后删除

列(变量)减少了数据集,其中包含任何缺失的
值。我通过观察每次观察失踪变量数量的分配来选择下降行的阈值,在分布的低端选择一个

数字,然后放弃

超过阈值的行。


制定问题的另一种方法:对于稀疏布尔矩阵

(在True上稀疏) ),删除行和列的最佳方法是什么?
这样矩阵中的元素总数是最大的,并且

没有剩下的真值。

例如:


0 0 0

0 0 0候选子矩阵有12个元素

0 0 0

0 0 0


1 0 0 0 1

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0候选子矩阵有15个元素

0 0 0 0 0 0 0 0 0 0

0 0 1 0 0

0 0

0 0候选子矩阵有8个元素

0 0

0 0


我想要programatica lly提取15元素矩阵


按照上面描述的方法,我在下面的例子中获得了所需的答案,尽管这是一个黑客解决方案而且我有

感觉有一个更好的。


来自数字导入非零,数组,取,总和


X =数组([[1,0,0,0,1],

[0,0,0,0,0],

[0,0 ,0,0,0],

[0,0,0,0,0],

[0,0,1,0,0]])


goodObsInd =非零(sum(X,1)< 2)#observation with< 2个缺少的变量

X = take(X,goodObsInd)#drop the bad


goodVarInd = nonzero(sum(X)== 0)#variables with没有遗漏的数据

X = take(X,goodVarInd,1)#drop the bad variables


print X

John Hunter


I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing. I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values. I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.
Example:

0 0 0
0 0 0 candidate sub matrix has 12 elements
0 0 0
0 0 0

1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0
0 0 candidate submatrix has 8 elements
0 0
0 0

I want to programatically extract the 15 element matrix

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

from Numeric import nonzero, array, take, sum

X = array([[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])

goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
X = take(X, goodObsInd) # drop the bad

goodVarInd = nonzero(sum(X)==0) # variables with no missing data
X = take(X, goodVarInd, 1 ) # drop the bad variables

print X
John Hunter

推荐答案

2003年7月2日星期三14:16:57 -0500,John Hunter< jd ****** @ ace.bsd.uchicago.edu>写道:
On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

我有一个较大的数据集(1000个观察值×100个浮点数变量),并且缺少一些数据。我想在数据上尝试各种聚类,神经网络等算法,并保持生活简单我想减少矩阵的尺寸
让我没有缺少值,因为并非所有的算法都能够处理它们,并且在变量中有足够的冗余可以让我失去一些。

我目前正在使用hack有效,但它让我想知道是否有最佳解决方案。我将最佳定义为删除行
和列,使得没有缺失值和
max(numRows * numCols)。

我当前的方法是删除行(观察)具有超过某些预定数量的缺失变量,然后删除具有任何缺失值的简化数据集的
列(变量)。我通过观察每次观察失踪变量数量的分布来选择下降行的阈值,在分布的低端选择一个
数字,然后放下超过
的行数。阈值。

制定问题的另一种方法:对于稀疏布尔矩阵
(稀疏为True),删除行和列的最佳方法是什么?矩阵中元素的数量是最大的,并且没有剩下的真值。

例如:

0 0 0
0 0 0候选子矩阵有12个元素0 0 0
0 0 0
1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0候选子矩阵有15个元素
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0 > 0 0候选子矩阵有8个元素
0 0
0 0

我想以编程方式提取15个元素矩阵


如果我理解你最初的最优性离子,这将是次优的。

即,16元素矩阵怎么样? (x'标记相应的零)


1 0 0 0 1

xx 0 xx

xx 0 xx

xx 0 xx

xx 1 xx


或者他们必须相邻吗?

按照上述方法我在下面的例子中得到了理想的答案,虽然这是一个黑客解决方案,我感觉有一个更好的答案。

来自Numeric import nonzero,array ,取,总和

X =数组([[1,0,0,0,1],
[0,0,0,0,0],
[ 0,0,0,0,0],
[0,0,0,0,0],
[0,0,1,0,0]])

goodObsInd =非零(sum(X,1)< 2)#observation with< 2缺少变量
X = take(X,goodObsInd)#drop bad


goodVarInd =非零(sum(X)== 0)#变量没有丢失数据
X = take(X,goodVarInd,1)#pall the bad variables

print X

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing. I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values. I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.
Example:

0 0 0
0 0 0 candidate sub matrix has 12 elements
0 0 0
0 0 0

1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0
0 0 candidate submatrix has 8 elements
0 0
0 0

I want to programatically extract the 15 element matrix
If I understand your original optimality definition, that would be suboptimal.
I.e., how about the 16-element matrix? (x''s mark the corresponding zeroes)

1 0 0 0 1
x x 0 x x
x x 0 x x
x x 0 x x
x x 1 x x

Or do they have to be adjacent?

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

from Numeric import nonzero, array, take, sum

X = array([[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])

goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
X = take(X, goodObsInd) # drop the bad

goodVarInd = nonzero(sum(X)==0) # variables with no missing data
X = take(X, goodVarInd, 1 ) # drop the bad variables

print X




暴力似乎适用于这个小例子。也许它可以记住和优化和/或任何能够足够快地处理你的大矩阵的东西吗?


====< submatrix.py> ================================

X = [

[1,0,0,0,1],

[0,0,0,0,0],

[0,0 ,0,0,0],

[0,0,0,0,0],

[0,0,1,0,0]
]


def optimrcs(mx,rows = [],cols = []):

如果不是行或不是cols:return 0,[],[]

maxscore = 0
行中r的


如果不是mx [r] .count(1) :对于col中的c,继续



如果不是mx [r] [c]:继续

#eval column submatrix vs row submatrix

colsxthis = cols [:]; colsxthis.remove(c)

colscore,rset,cset = optimrcs(mx,rows,colsxthis)

如果colscore> maxscore:

maxscore ,maxrows,maxcols = colscore,rset,cset

rowsxthis = rows [:]; rowsxthis.remove(r)

rowscore,rset,cset = optimrcs(mx,rowsxthis,cols)

if rowscore> = maxscore:

maxscore,maxrows,maxcols = rowscore,rset,cset

如果不是maxscore:

返回len(行)* len(cols),行,cols

返回maxscore,maxrows,maxcols


if __name__ ==''__ main__'':

得分,rowsel,colsel = optimrcs(X,范围(5),范围(5))

打印''得分=%s,行=%s,cols =%s''%(得分,rowsel,colsel)

打印

范围内的r(5):

row = X [r]

范围内的c(5):

如果r在rowsel中,c在colsel中:

print''<%s>''%row [c],

else :

打印''%s''%行[c],

打印

=========== ======================================= ==

跑步结果因此:


[21:24] C:\ pywk \ clp> submatrix.py

得分= 16,行= [1, 2,3,4],cols = [ 0,1,3,4]

1 0 0 0 1

< 0> &℃,GT; 0< 0> < 0>

< 0> &℃,GT; 0< 0> < 0>

< 0> &℃,GT; 0< 0> < 0>

< 0> &℃,GT; 1< 0> < 0>


问候,

Bengt Richter



Brute force seems to work on this little example. Maybe it can
be memoized and optimized and/or whatever to handle your larger matrix fast enough?

====< submatrix.py >================================
X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]

def optimrcs(mx, rows=[], cols=[]):
if not rows or not cols: return 0,[],[]
maxscore = 0
for r in rows:
if not mx[r].count(1): continue
for c in cols:
if not mx[r][c]: continue
# eval column submatrix vs row submatrix
colsxthis = cols[:]; colsxthis.remove(c)
colscore, rset, cset = optimrcs(mx, rows, colsxthis)
if colscore>maxscore:
maxscore, maxrows, maxcols = colscore, rset, cset
rowsxthis = rows[:]; rowsxthis.remove(r)
rowscore, rset, cset = optimrcs(mx, rowsxthis, cols)
if rowscore >= maxscore:
maxscore, maxrows, maxcols = rowscore, rset, cset
if not maxscore:
return len(rows)*len(cols), rows, cols
return maxscore, maxrows,maxcols

if __name__ == ''__main__'':
score, rowsel, colsel = optimrcs(X, range(5), range(5))
print ''Score = %s, rows = %s, cols = %s'' % (score,rowsel,colsel)
print
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print ''<%s>''% row[c],
else:
print '' %s ''% row[c],
print
================================================== ==
Running it results thus:

[21:24] C:\pywk\clp>submatrix.py
Score = 16, rows = [1, 2, 3, 4], cols = [0, 1, 3, 4]

1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>

Regards,
Bengt Richter




John Hunter < JD ****** @ ace.bsd.uchicago.edu>在留言中写道

新闻:ma ********************************** @ python .o rg ...

"John Hunter" <jd******@ace.bsd.uchicago.edu> wrote in message
news:ma**********************************@python.o rg...

我有一个较大的数据集(1000个观察值×100个浮点数变量),并且缺少一些数据。


一切都很典型 - 缺少数据是统计数据的祸根。

我想尝试各种聚类,神经网络等,关于数据的算法,以及保持生活简单我想减少
矩阵的维度,这样我就没有缺失值,因为并非所有的算法都能够处理它们并且在变量中有足够的冗余,我可以承受失去一些。

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing.
All too typical -- missing data are the bane of statistics.
I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.




统计学家尝试了各种方法。谷歌搜索''

统计数据缺少数据如果你愿意,你会给你一些线索。


Terry J. Reedy



Statisticians have tried a variety of approaches. Googling for ''
statistics "missing data" ''will give you some leads if you want.

Terry J. Reedy


2003年7月2日星期三14: 16:57-0500,John Hunter< jd ****** @ ace.bsd.uchicago.edu>写道:
On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

我有一个较大的数据集(1000个观察值×100个浮点数变量),并且缺少一些数据。我想在数据上尝试各种聚类,神经网络等算法,并保持生活简单我想减少矩阵的尺寸
让我没有缺少值,因为并非所有的算法都能够处理它们,并且在变量中有足够的冗余可以让我失去一些。

我目前正在使用hack有效,但它让我想知道是否有最佳解决方案。我将最佳定义为删除行
和列,使得没有缺失值和
max(numRows * numCols)。

我当前的方法是删除行(观察)具有超过某些预定数量的缺失变量,然后删除具有任何缺失值的简化数据集的
列(变量)。我通过观察每次观察失踪变量数量的分布来选择下降行的阈值,在分布的低端选择一个
数字,然后放下超过
的行数。阈值。

制定问题的另一种方法:对于稀疏布尔矩阵
(稀疏为True),删除行和列的最佳方法是什么?矩阵中元素的数量是最大的,并且没有剩下的真值。

例如:

0 0 0
0 0 0候选子矩阵有12个元素0 0 0
0 0 0
1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0候选子矩阵有15个元素
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0 > 0 0候选子矩阵有8个元素
0 0
0 0

我想以编程方式提取15个元素矩阵


如果我理解你最初的最优性离子,这将是次优的。

即,16元素矩阵怎么样? (x'标记相应的零)


1 0 0 0 1

xx 0 xx

xx 0 xx

xx 0 xx

xx 1 xx


或者他们必须相邻吗?

按照上述方法我在下面的例子中得到了理想的答案,虽然这是一个黑客解决方案,我感觉有一个更好的答案。

来自Numeric import nonzero,array ,取,总和

X =数组([[1,0,0,0,1],
[0,0,0,0,0],
[ 0,0,0,0,0],
[0,0,0,0,0],
[0,0,1,0,0]])

goodObsInd =非零(sum(X,1)< 2)#observation with< 2缺少变量
X = take(X,goodObsInd)#drop bad


goodVarInd =非零(sum(X)== 0)#变量没有丢失数据
X = take(X,goodVarInd,1)#pall the bad variables

print X

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing. I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values. I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.
Example:

0 0 0
0 0 0 candidate sub matrix has 12 elements
0 0 0
0 0 0

1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0
0 0 candidate submatrix has 8 elements
0 0
0 0

I want to programatically extract the 15 element matrix
If I understand your original optimality definition, that would be suboptimal.
I.e., how about the 16-element matrix? (x''s mark the corresponding zeroes)

1 0 0 0 1
x x 0 x x
x x 0 x x
x x 0 x x
x x 1 x x

Or do they have to be adjacent?

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

from Numeric import nonzero, array, take, sum

X = array([[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])

goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
X = take(X, goodObsInd) # drop the bad

goodVarInd = nonzero(sum(X)==0) # variables with no missing data
X = take(X, goodVarInd, 1 ) # drop the bad variables

print X




暴力似乎适用于这个小例子。也许它可以记住和优化和/或任何能够足够快地处理你的大矩阵的东西吗?


====< submatrix.py> ================================

X = [

[1,0,0,0,1],

[0,0,0,0,0],

[0,0 ,0,0,0],

[0,0,0,0,0],

[0,0,1,0,0]
]


def optimrcs(mx,rows = [],cols = []):

如果不是行或不是cols:return 0,[],[]

maxscore = 0
行中r的


如果不是mx [r] .count(1) :对于col中的c,继续



如果不是mx [r] [c]:继续

#eval column submatrix vs row submatrix

colsxthis = cols [:]; colsxthis.remove(c)

colscore,rset,cset = optimrcs(mx,rows,colsxthis)

如果colscore> maxscore:

maxscore ,maxrows,maxcols = colscore,rset,cset

rowsxthis = rows [:]; rowsxthis.remove(r)

rowscore,rset,cset = optimrcs(mx,rowsxthis,cols)

if rowscore> = maxscore:

maxscore,maxrows,maxcols = rowscore,rset,cset

如果不是maxscore:

返回len(行)* len(cols),行,cols

返回maxscore,maxrows,maxcols


if __name__ ==''__ main__'':

得分,rowsel,colsel = optimrcs(X,范围(5),范围(5))

打印''得分=%s,行=%s,cols =%s''%(得分,rowsel,colsel)

打印

范围内的r(5):

row = X [r]

范围内的c(5):

如果r在rowsel中,c在colsel中:

print''<%s>''%row [c],

else :

打印''%s''%行[c],

打印

=========== ======================================= ==

跑步结果因此:


[21:24] C:\ pywk \ clp> submatrix.py

得分= 16,行= [1, 2,3,4],cols = [ 0,1,3,4]

1 0 0 0 1

< 0> &℃,GT; 0< 0> < 0>

< 0> &℃,GT; 0< 0> < 0>

< 0> &℃,GT; 0< 0> < 0>

< 0> &℃,GT; 1< 0> < 0>


问候,

Bengt Richter



Brute force seems to work on this little example. Maybe it can
be memoized and optimized and/or whatever to handle your larger matrix fast enough?

====< submatrix.py >================================
X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]

def optimrcs(mx, rows=[], cols=[]):
if not rows or not cols: return 0,[],[]
maxscore = 0
for r in rows:
if not mx[r].count(1): continue
for c in cols:
if not mx[r][c]: continue
# eval column submatrix vs row submatrix
colsxthis = cols[:]; colsxthis.remove(c)
colscore, rset, cset = optimrcs(mx, rows, colsxthis)
if colscore>maxscore:
maxscore, maxrows, maxcols = colscore, rset, cset
rowsxthis = rows[:]; rowsxthis.remove(r)
rowscore, rset, cset = optimrcs(mx, rowsxthis, cols)
if rowscore >= maxscore:
maxscore, maxrows, maxcols = rowscore, rset, cset
if not maxscore:
return len(rows)*len(cols), rows, cols
return maxscore, maxrows,maxcols

if __name__ == ''__main__'':
score, rowsel, colsel = optimrcs(X, range(5), range(5))
print ''Score = %s, rows = %s, cols = %s'' % (score,rowsel,colsel)
print
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print ''<%s>''% row[c],
else:
print '' %s ''% row[c],
print
================================================== ==
Running it results thus:

[21:24] C:\pywk\clp>submatrix.py
Score = 16, rows = [1, 2, 3, 4], cols = [0, 1, 3, 4]

1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>

Regards,
Bengt Richter


这篇关于得到一个真正的子矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆