# spark-on-yarn-hive-derby **Repository Path**: crazypandariy/spark-on-yarn-hive-derby ## Basic Information - **Project Name**: spark-on-yarn-hive-derby - **Description**: 本地搭建spark on yarn 以及 采用derby服务模式作为metastore的hive环境 - **Primary Language**: Docker - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 1 - **Created**: 2024-03-08 - **Last Updated**: 2025-03-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # spark-on-yarn-hive-derby #### 介绍 本地搭建spark on yarn 以及 采用derby服务模式作为metastore的hive环境 #### 软件版本 | 组件 | 版本 | |---------|---------------------| | spark镜像 | bitnami/spark:3.1.2 | | hadoop | 3.2.0 | | hive | 3.1.2 | | derby | 10.14.2.0 | #### 准备工作 1. 下载derby(https://archive.apache.org/dist/db/derby/db-derby-10.14.2.0/db-derby-10.14.2.0-bin.tar.gz) 2. 下载hadoop(https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz) #### 使用说明 1. 构建基础镜像 2. 构建on-yarn 镜像 3. 构建on-yarn-hive镜像 #### 构建基础镜像 > 采用spark成熟镜像方案 bitnami/spark:3.1.2 作为原始镜像,在此基础上安装openssh,制作免密登录的基础镜像。由于master和worker节点均基于该基础镜像,其中的ssh密钥均相同,可以简化安装部署。 ```shell docker build -t my/spark-base:3.1.2 -f base/Dockerfile . ``` #### spark on yarn模式 ##### 构建on-yarn镜像 ```shell docker build -t my/spark-hadoop:3.1.2 -f on-yarn/Dockerfile . ``` ##### 启动on-yarn集群 ###### 手动方式 ```shell # 创建集群 docker-compose -f on-yarn/docker-compose-manul.yml -p spark up -d # 启动hadoop docker exec -it spark-master-1 sh /opt/start-hadoop.sh # 停止集群 docker-compose -f on-yarn/docker-compose-manul.yml -p spark stop # 删除集群 docker-compose -f on-yarn/docker-compose-manul.yml -p spark down # 启动集群 docker-compose -f on-yarn/docker-compose-manul.yml -p spark start # 启动hadoop docker exec -it spark-master-1 sh /opt/start-hadoop.sh ``` ###### 自动方式 ```shell # 创建集群 docker-compose -f on-yarn/docker-compose-auto.yml -p spark up -d # 停止集群 docker-compose -f on-yarn/docker-compose-auto.yml -p spark stop # 启动集群 docker-compose -f on-yarn/docker-compose-auto.yml -p spark start # 删除集群 docker-compose -f on-yarn/docker-compose-auto.yml -p spark down ``` #### spark on yarn with hive(derby server)模式 ##### 构建on-yarn-hive镜像 ```shell docker build -t my/spark-hadoop-hive:3.1.2 -f on-yarn-hive/Dockerfile . ``` ##### 启动on-yarn-hive集群 ###### 手动方式 ```shell # 创建集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark up -d # 启动hadoop docker exec -it spark-master-1 sh /opt/start-hadoop.sh # 启动hive docker exec -it spark-master-1 sh /opt/start-hive.sh # 停止集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark stop # 删除集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark down # 启动集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark start # 启动hadoop docker exec -it spark-master-1 sh /opt/start-hadoop.sh # 启动hive docker exec -it spark-master-1 sh /opt/start-hive.sh ``` ###### 自动方式 ```shell # 创建集群 docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark up -d # 停止集群 docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark stop # 启动集群 docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark start # 删除集群 docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark down ``` #### 常用示例 ##### spark执行sh脚本 ```shell spark-shell --master yarn << EOF // 脚本内容 // 示例 val data = Array(1,2,3,4,5) val distData = sc.parallelize(data) val sum = distData.reduce((a,b)=>a+b) println("Sum: "+sum) EOF ``` ##### Java远程提交Yarn任务 * 进入master容器,创建demo表,命令 `hive -e "create table demo(name string)"` * 创建maven项目,将core-site.xml yarn-site.xml hdfs-site.xml hive-site.xml等文件拷贝到src/main/resources * 将 local-spark-worker1 和 local-spark-master 指向本地虚拟网络适配器的IP地址 例如,我用的是windows系统,则使用SwitchHosts软件,修改上述hostname指向的IP地址 ``` 192.168.138.1 local-spark-worker1 192.168.138.1 local-spark-master ``` 上传spark依赖jar包 ```shell hdfs dfs -mkdir -p /spark/jars hdfs dfs -put -f /opt/bitnami/spark/jars/* /spark/jars ``` ###### maven部分依赖 ```xml org.apache.spark spark-core_2.12 3.1.2 org.apache.spark spark-yarn_2.12 3.1.2 org.junit.jupiter junit-jupiter 5.9.1 test ``` ###### java代码 ```java package org.demo.spark; import org.apache.spark.SparkConf; import org.apache.spark.deploy.yarn.Client; import org.apache.spark.deploy.yarn.ClientArguments; import org.junit.jupiter.api.Test; public class SparkOnYarnTest { @Test public void yarnApiSubmit() { // prepare arguments to be passed to // org.apache.spark.deploy.yarn.Client object String[] args = new String[] { "--jar","hdfs:///user/my/spark-sql-cluster.jar", "--class", "org.apache.spark.sql.hive.cluster.SparkSqlCliClusterDriver", "--arg", "spark-internal", "--arg", "-e", "--arg", "\\\"insert into demo(name) values('张三')\\\"" }; // identify that you will be using Spark as YARN mode // System.setProperty("SPARK_YARN_MODE", "true"); // create an instance of SparkConf object String appName = "Yarn Client Remote App"; SparkConf sparkConf = new SparkConf(); sparkConf.setMaster("yarn"); sparkConf.setAppName(appName); sparkConf.set("spark.submit.deployMode", "cluster"); sparkConf.set("spark.yarn.jars", "hdfs:///spark/jars/*.jar"); sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname", "local-spark-master"); sparkConf.set("spark.hadoop.yarn.resourcemanager.address", "local-spark-master:8032"); sparkConf.set("spark.hadoop.yarn.resourcemanager.scheduler.address", "local-spark-master:8030"); // create ClientArguments, which will be passed to Client ClientArguments cArgs = new ClientArguments(args); // create an instance of yarn Client client Client client = new Client(cArgs, sparkConf, null); // submit Spark job to YARN client.run(); } } ```