Spark (5) Pair RDD

8月 29, 2018

Spark (5) Pair RDD

1. Pair RDD:

Pair RDD 是平行化後的 Array of Tuples，通常透過 map 或是平行化 Array of Tuples 產生。 Tuple 的格式是 (key, value)。

將 RDD map 成 Pair RDD，key 是原本 RDD 的元素，value 皆為 1:

val a = sc.parallelize(1 to 10)
val pair = a.map((_, 1))

平行化 Array of Tuples:

val pair = sc.parallelize(List((1,2),(1,4),(3,4),(3,6)))

2. keys, values, sortByKey:

取得 Pair RDD 的 keys 和 values。

val pair = sc.parallelize(List((1,2),(1,4),(3,4),(3,6)))

pair.keys.collect
res0: Array[Int] = Array(1, 1, 3, 3)

pair.values.collect
res1: Array[Int] = Array(2, 4, 4, 6)

使用鍵進行排序。

pair.sortByKey(true).collect()  // 升序
pair.sortByKey(false).collect() // 降序

3. reduceByKey, groupByKey:

利用 reduceByKey 將所有同 key 的 values 進行加總，統計頻率時很好用。

pair.reduceByKey((x,y)=> x+y).collect
res2: Array[(Int, Int)] = Array((1,6), (3,10))

利用 groupByKey 將所有同 key 的 values 進行分群。

pair.groupByKey().collect
res3: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(2, 4)), (3,CompactBuffer(4, 6)))

4. mapValues:

mapValues 將 values 帶入函式處理，下面的例子將所有 values 乘以 2 倍。

pair.mapValues(2*_).collect
res4: Array[(Int, Int)] = Array((1,4), (1,8), (3,8), (3,12))

搜尋此網誌

簡單最重要

Spark (5) Pair RDD

1. Pair RDD:

2. keys, values, sortByKey:

3. reduceByKey, groupByKey:

4. mapValues:

留言

張貼留言

熱門文章

用 Python Jupyter notebook 處理 CSV 檔

Chef (1) Install Chef Development Kit and Basic Ruby Syntax