2 Star 5 Fork 1

liujingwen/docker-hadoop-spark

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

Gitter chat

Docker multi-container environment with Hadoop, Spark and Hive

This is it: a Docker multi-container environment with Hadoop (HDFS), Spark and Hive. But without the large memory requirements of a Cloudera sandbox. (On my Windows 10 laptop (with WSL2) it seems to consume a mere 3 GB.)

The only thing lacking, is that Hive server doesn't start automatically. To be added when I understand how to do that in docker-compose.

Quick Start

To deploy an the HDFS-Spark-Hive cluster, run:

  docker-compose up

docker-compose creates a docker network that can be found by running docker network list, e.g. docker-hadoop-spark-hive_default.

Run docker network inspect on the network (e.g. docker-hadoop-spark-hive_default) to find the IP the hadoop interfaces are published on. Access these interfaces with the following URLs:

  • Namenode: http://<dockerhadoop_IP_address>:9870/dfshealth.html#tab-overview
  • History server: http://<dockerhadoop_IP_address>:8188/applicationhistory
  • Datanode: http://<dockerhadoop_IP_address>:9864/
  • Nodemanager: http://<dockerhadoop_IP_address>:8042/node
  • Resource manager: http://<dockerhadoop_IP_address>:8088/
  • Spark master: http://<dockerhadoop_IP_address>:8080/
  • Spark worker: http://<dockerhadoop_IP_address>:8081/
  • Hive: http://<dockerhadoop_IP_address>:10000

Quick Start HDFS

Copy breweries.csv to the namenode.

  docker cp breweries.csv namenode:breweries.csv

Go to the bash shell on the namenode with that same Container ID of the namenode.

  docker exec -it namenode bash

Create a HDFS directory /data//openbeer/breweries.

  hdfs dfs -mkdir /data
  hdfs dfs -mkdir /data/openbeer
  hdfs dfs -mkdir /data/openbeer/breweries

Copy breweries.csv to HDFS:

  hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv

Quick Start Spark (PySpark)

Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master.

Go to the command line of the Spark master and start PySpark.

  docker exec -it spark-master bash

  /spark/bin/pyspark --master spark://spark-master:7077

Load breweries.csv from HDFS.

  brewfile = spark.read.csv("hdfs://namenode:8020/data/openbeer/breweries/breweries.csv")
  
  brewfile.show()
+----+--------------------+-------------+-----+---+
| _c0|                 _c1|          _c2|  _c3|_c4|
+----+--------------------+-------------+-----+---+
|null|                name|         city|state| id|
|   0|  NorthGate Brewing |  Minneapolis|   MN|  0|
|   1|Against the Grain...|   Louisville|   KY|  1|
|   2|Jack's Abby Craft...|   Framingham|   MA|  2|
|   3|Mike Hess Brewing...|    San Diego|   CA|  3|
|   4|Fort Point Beer C...|San Francisco|   CA|  4|
|   5|COAST Brewing Com...|   Charleston|   SC|  5|
|   6|Great Divide Brew...|       Denver|   CO|  6|
|   7|    Tapistry Brewing|     Bridgman|   MI|  7|
|   8|    Big Lake Brewing|      Holland|   MI|  8|
|   9|The Mitten Brewin...| Grand Rapids|   MI|  9|
|  10|      Brewery Vivant| Grand Rapids|   MI| 10|
|  11|    Petoskey Brewing|     Petoskey|   MI| 11|
|  12|  Blackrocks Brewery|    Marquette|   MI| 12|
|  13|Perrin Brewing Co...|Comstock Park|   MI| 13|
|  14|Witch's Hat Brewi...|   South Lyon|   MI| 14|
|  15|Founders Brewing ...| Grand Rapids|   MI| 15|
|  16|   Flat 12 Bierwerks| Indianapolis|   IN| 16|
|  17|Tin Man Brewing C...|   Evansville|   IN| 17|
|  18|Black Acre Brewin...| Indianapolis|   IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows

Quick Start Spark (Scala)

Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master.

Go to the command line of the Spark master and start spark-shell.

  docker exec -it spark-master bash
  
  spark/bin/spark-shell --master spark://spark-master:7077

Load breweries.csv from HDFS.

  val df = spark.read.csv("hdfs://namenode:8020/data/openbeer/breweries/breweries.csv")
  
  df.show()
+----+--------------------+-------------+-----+---+
| _c0|                 _c1|          _c2|  _c3|_c4|
+----+--------------------+-------------+-----+---+
|null|                name|         city|state| id|
|   0|  NorthGate Brewing |  Minneapolis|   MN|  0|
|   1|Against the Grain...|   Louisville|   KY|  1|
|   2|Jack's Abby Craft...|   Framingham|   MA|  2|
|   3|Mike Hess Brewing...|    San Diego|   CA|  3|
|   4|Fort Point Beer C...|San Francisco|   CA|  4|
|   5|COAST Brewing Com...|   Charleston|   SC|  5|
|   6|Great Divide Brew...|       Denver|   CO|  6|
|   7|    Tapistry Brewing|     Bridgman|   MI|  7|
|   8|    Big Lake Brewing|      Holland|   MI|  8|
|   9|The Mitten Brewin...| Grand Rapids|   MI|  9|
|  10|      Brewery Vivant| Grand Rapids|   MI| 10|
|  11|    Petoskey Brewing|     Petoskey|   MI| 11|
|  12|  Blackrocks Brewery|    Marquette|   MI| 12|
|  13|Perrin Brewing Co...|Comstock Park|   MI| 13|
|  14|Witch's Hat Brewi...|   South Lyon|   MI| 14|
|  15|Founders Brewing ...| Grand Rapids|   MI| 15|
|  16|   Flat 12 Bierwerks| Indianapolis|   IN| 16|
|  17|Tin Man Brewing C...|   Evansville|   IN| 17|
|  18|Black Acre Brewin...| Indianapolis|   IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows

How cool is that? Your own Spark cluster to play with.

Quick Start Hive

Go to the command line of the Hive server and start hiveserver2

  docker exec -it hive-server bash

  hiveserver2

Maybe a little check that something is listening on port 10000 now

  netstat -anp | grep 10000
tcp        0      0 0.0.0.0:10000           0.0.0.0:*               LISTEN      446/java

Okay. Beeline is the command line interface with Hive. Let's connect to hiveserver2 now.

  beeline
  
  !connect jdbc:hive2://127.0.0.1:10000 scott tiger

Didn't expect to encounter scott/tiger again after my Oracle days. But there you have it. Definitely not a good idea to keep that user on production.

Not a lot of databases here yet.

  show databases;
  
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (0.335 seconds)

Let's change that.

  create database openbeer;
  use openbeer;

And let's create a table.

CREATE EXTERNAL TABLE IF NOT EXISTS breweries(
    NUM INT,
    NAME CHAR(100),
    CITY CHAR(100),
    STATE CHAR(100),
    ID INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/data/openbeer/breweries';

And have a little select statement going.

  select name from breweries limit 10;
+----------------------------------------------------+
|                        name                        |
+----------------------------------------------------+
| name                                                                                                 |
| NorthGate Brewing                                                                                    |
| Against the Grain Brewery                                                                            |
| Jack's Abby Craft Lagers                                                                             |
| Mike Hess Brewing Company                                                                            |
| Fort Point Beer Company                                                                              |
| COAST Brewing Company                                                                                |
| Great Divide Brewing Company                                                                         |
| Tapistry Brewing                                                                                     |
| Big Lake Brewing                                                                                     |
+----------------------------------------------------+
10 rows selected (0.113 seconds)

There you go: your private Hive server to play with.

Configure Environment Variables

The configuration parameters can be specified in the hadoop.env file or as environmental variables for specific services (e.g. namenode, datanode etc.):

  CORE_CONF_fs_defaultFS=hdfs://namenode:8020

CORE_CONF corresponds to core-site.xml. fs_defaultFS=hdfs://namenode:8020 will be transformed into:

  <property><name>fs.defaultFS</name><value>hdfs://namenode:8020</value></property>

To define dash inside a configuration parameter, use triple underscore, such as YARN_CONF_yarn_log___aggregation___enable=true (yarn-site.xml):

  <property><name>yarn.log-aggregation-enable</name><value>true</value></property>

The available configurations are:

  • /etc/hadoop/core-site.xml CORE_CONF
  • /etc/hadoop/hdfs-site.xml HDFS_CONF
  • /etc/hadoop/yarn-site.xml YARN_CONF
  • /etc/hadoop/httpfs-site.xml HTTPFS_CONF
  • /etc/hadoop/kms-site.xml KMS_CONF
  • /etc/hadoop/mapred-site.xml MAPRED_CONF

If you need to extend some other configuration file, refer to base/entrypoint.sh bash script.

空文件

简介

docker构建hadoop、spark、hive、presto服务 自选 展开 收起
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Docker
1
https://gitee.com/liujingwen-git/docker-hadoop-spark.git
git@gitee.com:liujingwen-git/docker-hadoop-spark.git
liujingwen-git
docker-hadoop-spark
docker-hadoop-spark
master

搜索帮助