# phase-4-task-4

**Repository Path**: jhhe66/phase-4-task-4

## Basic Information

- **Project Name**: phase-4-task-4
- **Description**: 大数据作业四阶段任务4
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-13
- **Last Updated**: 2021-04-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 作业：
## 简答题：
```scala
package com.jhhe.homework4_4

import org.apache.spark.rdd.RDD
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}

object JoinDemo {
  def main(args: Array[String]): Unit = {
    val init:String = this.getClass.getCanonicalName.init
    println(init)
    val conf: SparkConf = new SparkConf().setAppName(init).setMaster("local[*]")
    val sc: SparkContext = new SparkContext(conf)
    sc.setLogLevel("warn")

    val random = scala.util.Random
    val col1 = Range(1,50).map(idx => (random.nextInt(10), s"user&idx"))
    val col2 = Array((0, "BJ"), (1, "SH"), (2,"GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0,"CD"))

    val rdd1: RDD[(Int, String)] = sc.makeRDD(col1)
    val rdd2: RDD[(Int, String)] = sc.makeRDD(col2)

    val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2)
    println(rdd3.dependencies)

    val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
    println(rdd4.dependencies)

    sc.stop()
  }
}
```
两个打印语句的结果是什么，对应的依赖是宽依赖还是窄依赖，为什么会是这个结果?
join 操作何时是宽依赖，何时是窄依赖;
借助 join 相关源码，回答以上问题。
答：打印的结果：
List(org.apache.spark.OneToOneDependency@35088e87)
List(org.apache.spark.OneToOneDependency@1dc2de84)
从打印的结果来看，是窄依赖。


进入源码:可以看到这里基本没有什么操作，只是设置了默认的分区器，进入defaultPartitioner()方法
```scala
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    join(other, defaultPartitioner(self, other))
  }
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    // 判断传入的rdd有没有设置分区器partitioner
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
    
    //如果设置了partitioner，则取设置partitioner的最大分区数
    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }
 
    //判断是否设置了spark.default.parallelism，如果设置了则取spark.default.parallelism
    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }
 
    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than the default number of partitions, use the existing partitioner.
    //主要判断传入rdd是否设置了默认的partitioner 以及设置的partitioner是否合法                
    //或者设置的partitioner分区数大于默认的分区数 
    //条件成立则取传入rdd最大的分区数，否则取默认的分区数
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }
  private def isEligiblePartitioner(
     hasMaxPartitioner: RDD[_],
     rdds: Seq[RDD[_]]): Boolean = {
    val maxPartitions = rdds.map(_.partitions.length).max
    log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1
  }
}
```
获取默认的分区器后进入join的重载方法，里面最重要的是cogroup对象 new CoGroupedRDD[K](Seq(self, other), partitioner)
```scala
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }
  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    //partitioner 通过对比得到的默认分区器，主要是分区器中的分区数
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues { case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }
```
这里也就是决定join是宽依赖还是窄依赖的地方:
```scala
override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
    //判断join 左右的rdd是否和上面选择的默认分区器分区数一致，如果一致则是窄依赖，否则就是宽依赖
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](
        rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
    }
  }
}
```
总结：如果join操作左右两边的rdd分区数是一致的情况，则不会产生shuffle，是窄依赖，否则是宽依赖。