Hadoop:如何在键值对中将双精度数组作为值? [英] Hadoop: How can i have an array of doubles as a value in a key-value pair?
问题描述
我有一个问题,我需要汇总一些向量以便找到一些统计信息.例如,我有双精度向量,我需要将它们求和.我的向量看起来像这样:
I have a problem were i need to aggregate some vectors in order to find some statistics. For example i have vectors of doubles and i need to sum them. My vectors look like this:
1,0,3,4,5
2,3,4,5,6
3,4,5,5,6
到目前为止,我的键值对是(String,String).但是每次我需要添加这些向量时,我首先必须将它们转换为双精度数组,将它们相加,最后将聚合向量转换为字符串.我认为如果我可以将键-值对形式为(String,double array),它将更快.无需来回转换它们.我的问题是我找不到一种将双精度数组作为值的方法.除了创建新的自定义类型,还有什么简单的方法吗?
My key-value pairs so far are (String,String). But every time i need to add these vectors, i first have to convert them to double arrays, add them up and finally convert the aggregate vector into string. I think it would be a lot faster if i just could have key-value pairs in the form (String,double array). There would be no need to convert them back and forth. My problem is that i cant find a way to have double arrays as value. Is there any easy way rather than create a new custom type?
推荐答案
您的意思是这样的吗?
Map<String, List<Double>> arrays = new HashMap<String, List<Double>>();
double[] array;
arrays.put("ArrayKey", Arrays.asList(array));
然后您可以调用地图方法:
then you could call your map method:
map(String key, String arrayKey) {
List<Double> value = arrays.get(arrayKey);
}
您还可以序列化双精度数组,然后反序列化它:
Also you can serialize your double array, and then deserialize it back:
package test;
import org.apache.commons.codec.binary.Base64InputStream;
import org.apache.commons.codec.binary.Base64OutputStream;
import java.io.*;
import java.util.Arrays;
public class Test {
public static void main(String[] args) throws IOException, ClassNotFoundException {
double[] array = {0.0, 1.1, 2.2, 3.3};
String stringValue = serialize(array);
map("Key", stringValue);
}
public static void map(String key, String value) throws ClassNotFoundException, IOException {
double[] array = deserialize(value);
System.out.println("Key=" + key + "; Value=" + Arrays.toString(array));
}
public static String serialize(double[] array) throws IOException {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
Base64OutputStream base64OutputStream = new Base64OutputStream(byteArrayOutputStream);
ObjectOutputStream oos = new ObjectOutputStream(base64OutputStream);
oos.writeObject(array);
oos.flush();
oos.close();
return byteArrayOutputStream.toString();
}
public static double[] deserialize(String stringArray) throws IOException, ClassNotFoundException {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(stringArray.getBytes());
Base64InputStream base64InputStream = new Base64InputStream(byteArrayInputStream);
ObjectInputStream iis = new ObjectInputStream(base64InputStream);
return (double[]) iis.readObject();
}
}
输出:
Key=Key; Value=[0.0, 1.1, 2.2, 3.3]
映射速度更快,但是如果您为此使用节点和集群(如果需要将数组传递到另一个JVM),则序列化将更有用:
Mapping is faster, but serialization will be more usefull if you use nodes and clusters for that (if you need to pass your arrays into another JVM):
private static class SpeedTest {
private static final Map<String, List> arrays = new HashMap<String, List>();
public static void test(final double[] array) throws IOException, ClassNotFoundException {
final String str = serialize(array);
final int amount = 10 * 1000;
long timeStamp = System.currentTimeMillis();
for (int i = 0; i < amount; i++) {
serialize(array);
}
System.out.println("Serialize: " + (System.currentTimeMillis() - timeStamp) + " ms");
timeStamp = System.currentTimeMillis();
for (int i = 0; i < amount; i++) {
deserialize(str);
}
System.out.println("Deserialize: " + (System.currentTimeMillis() - timeStamp) + " ms");
arrays.clear();
timeStamp = System.currentTimeMillis();
// Prepaire map, that contains reference for all arrays.
for (int i = 0; i < amount; i++) {
arrays.put("key_" + i, Arrays.asList(array));
}
// Getting array by its key in map.
for (int i = 0; i < amount; i++) {
arrays.get("key_" + i).toArray();
}
System.out.println("Mapping: " + (System.currentTimeMillis() - timeStamp) + " ms");
}
}
输出:
Serialize: 298 ms
Deserialize: 254 ms
Mapping: 27 ms
这篇关于Hadoop:如何在键值对中将双精度数组作为值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!