合并两个CSV文件与Scala [英] Merge the intersection of two CSV files with Scala

查看:515
本文介绍了合并两个CSV文件与Scala的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从输入1:

fruit, apple, cider  
animal, beef, burger

并输入2:

animal, beef, 5kg
fruit, apple, 2liter
fish, tuna, 1kg


$ b我需要生产:

I need to produce:

fruit, apple, cider, 2liter
animal, beef, burger, 5kg

我能得到的最接近的例子是:

The closest example I could get is:

object FileMerger {
def main(args : Array[String]) {
  import scala.io._
  val f1 = (Source fromFile "file1.csv" getLines) map (_.split(", *")(1))
  val f2 = Source fromFile "file2.csv" getLines
  val out = new java.io.FileWriter("output.csv")
  f1 zip f2 foreach { x => out.write(x._1 + ", " + x._2 + "\n") }
  out.close
  }
}

问题是,该示例假设两个CSV文件包含相同数量的元素并且顺序相同。我的合并结果必须只包含第一个和第二个文件中的元素。

The problem is that the example assumes that the two CSV files contain the same number of elements and in the same order. My merged result must only contain elements that are in the first and the second file. I am new to Scala, and any help will be greatly appreciated.

推荐答案

您需要一个 intersection :来自file1和file2的共享一些条件的行。通过集合理论的角度来考虑这一点:你有两个集合,有一些共同的元素,你需要一个新集合与这些元素。好吧,还有更多的,因为线不是真的相等...

You need an intersection of the two files: the lines from file1 and file2 which share some criteria. Consider this through a set theory perspective: you have two sets with some elements in common, and you need a new set with those elements. Well, there's more to it than that, because the lines aren't really equal...

所以,让我们说,你读的file1,这是类型 List [Input1] 。我们可以像这样编码,而不需要了解 Input1 的细节:

So, let's say you read file1, and that's of type List[Input1]. We could code it like this, without getting into any details of what Input1 is:

case class Input1(line: String)
val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map Input1).toList

我们可以对file2和做同样的事情:List [Input2]

We can do the same thing for file2 and List[Input2]:

case class Input2(line: String)
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map Input2).toList

你可能会想知道为什么我创建了两个不同的类,相同的定义。好吧,如果你正在阅读结构化数据,你有两种不同的类型,所以让我们来看看如何处理更复杂的情况。

You might be wondering why I created two different classes if they have the exact same definition. Well, if you were reading structured data, you would have two different types, so let's see how to handle that more complex case.

Ok ,因此我们如何匹配它们,因为 Input1 Input2 是不同的类型?好吧,这些行与键匹配,根据您的代码,是每个的第一列。所以让我们创建一个类 Key ,转换 Input1 =>键 Input2 =>键

Ok, so how do we match them, since Input1 and Input2 are different types? Well, the lines are matched by keys, which, according to your code, are the first column in each. So let's create a class Key, and conversions Input1 => Key and Input2 => Key:

case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.line split "," head) // using regex would be better
def Input2IsKey(input: Input2): Key = Key(input.line split "," head)

好,现在我们可以生成一个公共的 Input1 Input2 ,我们得到它们的交集:

Ok, now that we can produce a common Key from Input1 and Input2, let's get the intersection of them:

val intersection = (f1 map Input1IsKey).toSet intersect (f2 map Input2IsKey).toSet

所以我们可以建立我们想要的交叉线,但是我们没有线!问题是,对于每个密钥,我们需要知道它从哪条线来。考虑到我们有一组键,对于每个键我们想要跟踪一个值 - 这正是 Map 是什么!所以我们可以建立这个:

So we can build the intersection of lines we want, but we don't have the lines! The problem is that, for each key, we need to know from which line it came. Consider that we have a set of keys, and for each key we want to keep track of a value -- that's exactly what a Map is! So we can build this:

val m1 = (f1 map (input => Input1IsKey(input) -> input)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input)).toMap

所以输出可以这样产生:

So the output can be produced like this:

val output = intersection map (key => m1(key).line + ", " + m2(key).line)

你现在要做的是输出。

让我们考虑一些对这个代码的改进。首先,请注意,上面产生的输出重复了键 - 这正是代码的作用,但不是你想要的示例中。让我们改变,然后, Input1 Input2 将键从其余的args中拆分:

Let's consider some improvements on this code. First, note that the output produced above repeats the key -- that's exactly what your code does, but not what you want in the example. Let's change, then, Input1 and Input2 to split the key from the rest of the args:

case class Input1(key: String, rest: String)
case class Input2(key: String, rest: String)

现在更难以初始化f1和f2。而不是使用 split ,这将会不必要地破坏所有的行(并以很高的性能成本),我们将把行在第一个逗号分隔:之前的一切是键,一切后休息。方法 span 会执行:

It's now a bit harder to initialize f1 and f2. Instead of using split, which will break all the line unnecessarily (and at great cost to performance), we'll divide the line right the at the first comma: everything before is key, everything after is rest. The method span does that:

def breakLine(line: String): (String, String) = line span (',' !=)

span 方法,以获得更好的理解。至于(','!=),这只是一个缩写形式的(x =>','!= x) code>。

Play a bit with the span method on REPL to get a better understanding of it. As for (',' !=), that's just an abbreviated form of saying (x => ',' != x).

接下来,我们需要一种方法来创建 Input1 Input2 breakLine 的结果):

Next, we need a way to create Input1 and Input2 from a tuple (the result of breakLine):

def TupleIsInput1(tuple: (String, String)) = Input1(tuple._1, tuple._2)
def TupleIsInput2(tuple: (String, String)) = Input2(tuple._1, tuple._2)

我们现在可以读取这些文件:

We can now read the files:

val f1: List[Input1] = (Source fromFile "file1.csv" getLines () map breakLine map TupleIsInput1).toList
val f2: List[Input2] = (Source fromFile "file2.csv" getLines () map breakLine map TupleIsInput2).toList

我们可以简化是交集。当我们创建一个 Map 时,它的键集合,所以我们可以先创建地图,然后使用他们的键来计算交集: / p>

Another thing we can simplify is intersection. When we create a Map, its keys are sets, so we can create the maps first, and then use their keys to compute the intersection:

case class Key(key: String)
def Input1IsKey(input: Input1): Key = Key(input.key)
def Input2IsKey(input: Input2): Key = Key(input.key)

// We now only keep the "rest" as the map value
val m1 = (f1 map (input => Input1IsKey(input) -> input.rest)).toMap
val m2 = (f2 map (input => Input2IsKey(input) -> input.rest)).toMap

val intersection = m1.keySet intersect m2.keySet

输出计算如下:

val output = intersection map (key => key + m1(key) + m2(key))

请注意,我不再追加逗号了 - f1和f2的其余部分开始已有逗号。

Note that I don't append comma anymore -- the rest of both f1 and f2 start with a comma already.

这篇关于合并两个CSV文件与Scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆