使用火花和RDD映射卡桑德拉数据库的表 [英] Map a table of a cassandra database using spark and RDD

查看:200
本文介绍了使用火花和RDD映射卡桑德拉数据库的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要映射这是写一个应用程序的使用历史表。该表已经得到了这些元组:

i have to map a table in which is written the history of utilization of an app. The table has got these tuples:

<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>

的AppId 永远是不同的,因为许多应用程序,日期引用在此pssed前$ P $格式为: DD / MM / YYYY HH /毫米 cpuUsage 的MemoryUsage pssed前$ p $因此,例如:

AppId is always different, because is referenced at many app, date is expressed in this format dd/mm/yyyy hh/mm cpuUsage and memoryUsage are expressed in % so for example:

<3ghffh3t482age20304,230720142245,0.2,3,5>

我取自卡桑德拉数据以这种方式(小片段):

I retrieved the data from cassandra in this way (little snippet):

public static void main(String[] args) {
        Cluster cluster;
        Session session;
        cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
        session = cluster.connect();
        session.execute("CREATE KEYSPACE IF NOT EXISTS foo WITH replication "
                + "= {'class':'SimpleStrategy', 'replication_factor':3};");
        String createTableAppUsage = "CREATE TABLE IF NOT EXISTS foo.appusage"
                + "(appid text,date text, cpuusage double, memoryusage double, "
                + "PRIMARY KEY(appid,date) " + "WITH CLUSTERING ORDER BY (time ASC);";
        session.execute(createTableAppUsage);
        // Use select to get the appusage's table rows
        ResultSet resultForAppUsage = session.execute("SELECT appid,cpuusage FROM foo.appusage");
       for (Row row: resultForAppUsage)
             System.out.println("appid :" + row.getString("appid") +" "+ "cpuusage"+row.getString("cpuusage"));
        // Clean up the connection by closing it
        cluster.close();
    }

所以,我现在的问题是由键值的数据映射并创建一个元组集成了这项code(片断这是没有按'将不起作用):

So, my problem now is to map the data by key value and create a tuple integrating this code (snippet that's doesn't work):

        <AppId,cpuusage>

        JavaPairRDD<String, Integer> saveTupleKeyValue =someStructureFromTakeData.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String x) {
                return new Tuple2(x, y);
            }

我怎么可以使用RDD映射的appid和cpuusage和减少如。 cpuusage&GT; 50

任何帮助吗?

先谢谢了。

推荐答案

假设你有一个有效的SparkContext sparkContext 已经创建,添加了火花卡桑德拉连接器依存关系您的项目,配置了火花应用谈谈您卡桑德拉集群(见的文档为),那么我们可以在RDD这样加载数据:

Assuming that you have a valid SparkContext sparkContext already created, have added the spark-cassandra connector dependencies to your project and configured your spark application to talk to your cassandra cluster (see docs for that), then we can load the data in an RDD like this:

val data = sparkContext.cassandraTable("foo", "appusage").select("appid", "cpuusage")

在Java中,这个想法是一样的,但它需要多一点的管道,描述的这里

In Java, the idea is the same but it requires a bit more plumbing, described here

这篇关于使用火花和RDD映射卡桑德拉数据库的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆