数学中的条件数据处理 [英] Conditional Data Manipulation in Mathematica

查看:189
本文介绍了数学中的条件数据处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为Mathematica准备用于高效数据分析的最佳工具
我有一个约300列和100 000行

I am trying to prepare the best tools for efficient Data Analysis in Mathematica. I have a approximately 300 Columns & 100 000 Rows.

最好的技巧是什么:

删除,提取或简单地考虑部分数据结构,用于绘制例如

"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.

我可以想到的一个最棘手的例子是:

One of the trickiest examples I could think of is :


给定一个数据结构

提取列1到3,6到9作为列2中的值等于x并且第8列中的值与y不同的每行的最后一个不同于

Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y

我也欢迎任何有关数据操作的一般建议。

I also welcome any general advice on data manipulation.

推荐答案

对于具有命名列的表中的数据的通用操作,你要这个我的解决方案,一个类似的问题。对于任何特定情况,手动编写选择的函数可能会更容易。然而,对于许多列和许多不同的查询,可能弄乱索引的机会很高。这是来自上述帖子的修改后的解决方案,它提供了一个更友好的语法:

For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:

Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];

ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules =  Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
    indexRules  =  Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
     With[{selF = Apply[Function, Hold[condition] /. colRules]},
       Select[{rows}, selF @@ # &][[All, cnames /. indexRules]]]];

这里发生的是在中使用的功能选择根据您的规格自动生成。例如(使用@ Yoda的例子):

What happens here is that the function used in Select gets generated automatically from your specifications. For example (using @Yoda's example):

rows = Array[#1 #2 &, {5, 15}];

我们需要定义列名称(必须是没有值的字符串或符号):

We need to define the column names (must be strings or symbols without values):

In[425]:= 
colnames = "c" <> ToString[#] & /@ Range[15]

Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", 
"c13", "c14", "c15"}

(实际上,通常名称更具描述性,当然)。这是表格:

(in practice, usually names are more descriptive, of course). Here is the table then:

table = Prepend[rows, colnames];

这是您需要的select语句(我选择 x = 4 y = 2 ):

Here is the select statement you need (I picked x = 4 and y=2):

select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
    where["c2" == 4 && "c8" != 2]]

{{2, 4, 6, 12, 14, 16, 18, 30}}

现在,对于单个查询,这可能看起来像一个复杂的方法来做到这一点。但是您可以执行许多不同的查询,例如

Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as

In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]

Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}

等等。

当然如果在您的数据中有特定的相关性,您可能会发现一种特定的专用算法将更快。以上功能可以通过多种方式扩展,简化常见查询(包括全部等),或自动编译生成的纯函数(如果可能)。

Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).

编辑

在哲学笔记中,我相信很多Mathematica用户(我自己包括)发现自己不时地编写类似的代码一次又一次。 Mathematica有一个简洁的语法这一事实使得任何特定情况下都可以很容易地编写它。然而,只要一个在某个特定领域(例如表中的数据操作)工作,对于许多操作来说,重复自己的代价就会很高。我的例子在一个非常简单的设置中说明了一个可能的出路 - 创建一个域特定语言(DSL)。为此,通常需要为其定义语法/语法,并将编译器从其编写为Mathematica(以自动生成数学代码)。现在,上面的例子是这个想法的一个非常原始的实现,但我的观点是,数学通常非常适合DSL创建,我认为这是一个非常强大的技术。

On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.

这篇关于数学中的条件数据处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆