# apache-impala-cluster-docker **Repository Path**: geshihui/apache-impala-cluster-docker ## Basic Information - **Project Name**: apache-impala-cluster-docker - **Description**: Complete Apache Impala 3.4.0 cluster - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-08-12 - **Last Updated**: 2021-08-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README In order to deploy an Apache Impala, you need these components: Apache Zookeeper, PostgresSQL, Apache Hadoop and Apache Metastore. The Apache Impala was compiled from here: https://downloads.apache.org/impala/3.4.0/apache-impala-3.4.0.tar.gz The Metastore and Hadoop were taken from Apache download as well: https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz * Hive server2 is not working (bug?) it doesn't bind port 10000. However, The Impala needs Metastore to work. If you wish you use your own xml conf directory, you can mount it this way: -v /your/conf:/opt/hadoop/conf -v /your/conf:/opt/hive/conf -v /your/conf:/opt/impala/conf In order to deploy the cluster, please follow the below step by step. How to run Apache Impala 3.4.0 cluster ============================== This implementation was tested using Google cloud platform. Four instances are needed. using Ubuntu/Debian Distro. It can be run on Centos as well (mind the pkg installation). Make sure the selinux is disabled and instances/servers are resolved (e.g. a datanode can ping namenode by its name) : 1* namenode should be called “namenode” (because of hadoop xml build-in files). It runs Zookeeper, Postgres, Namenode, Metastore, Statestored and Catalogd (Impala). 3* datanodes with 5 formatted disks (mount on /data0 .. /data4). It runs Datanode and Impalad (Impala server and shell). * there is a "systemd" directory. it has data0..4.mount that mount block devices. # Run on namenode instance: ETH0=`ip addr show ens4 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1` apt update apt -y install docker.io #### ZOOKEEPER docker run --net=host --name zookeeper --restart always -d zookeeper #### POSTGRES docker run --name postgresql --net=host -e POSTGRES_PASSWORD=mypassword -e POSTGRES_USER=hiveuser -e POSTGRES_DB=metastore --restart always -v /var/lib/postgresql/data:/var/lib/postgresql/data -d postgres:latest #### NAMENODE docker run --net=host --name namenode -v /var/lib/hadoop-hdfs/cache/hdfs/dfs/name:/var/lib/hadoop-hdfs/cache/hdfs/dfs/name --restart always -d yarivgraf/apache-hadoop-3.2.2:latest /run-namenode.sh docker exec namenode /provision-namenode.sh #### METASTORE docker run --net=host --name provision -d yarivgraf/apache-metastore-3.1.2:latest /opt/hive/bin/schematool -initSchema -dbType postgres -userName hiveuser -passWord 'mypassword' docker run --net=host --name metastore --restart always -d yarivgraf/apache-metastore-3.1.2:latest /run-metastore.sh #### STATESTORED docker run --net=host --name statestored -e IP=$ETH0 -v /opt/impala/logs:/opt/impala/logs --restart always -d yarivgraf/apache-impala-3.4.0:latest /entrypoint_statestored.sh #### CATALOGD docker run --net=host --name catalogd -e IP=$ETH0 -v /opt/impala/logs:/opt/impala/logs --restart always -d yarivgraf/apache-impala-3.4.0:latest /entrypoint_catalogd.sh # Run on datanode instances (At least 3) apt update apt -y install docker.io ##### Internal namenode IP NAMENODE_IP=`host namenode| gawk '{ print $4 }'` #### DATANODE docker run --net=host --name datanode --restart always -v /var/run/hadoop-hdfs:/var/run/hadoop-hdfs -v /data0:/data0 -v /data1:/data1 -v /data2:/data2 -v /data3:/data3 -v /data4:/data4 -d yarivgraf/apache-hadoop-3.2.2:latest /run-datanode.sh #### IMPALAD docker run --net=host --name impalad -e IP=$NAMENODE_IP --restart always -v /var/run/hadoop-hdfs:/var/run/hadoop-hdfs -v /opt/impala/logs:/opt/impala/logs -d yarivgraf/apache-impala-3.4.0:latest /entrypoint_impalad.sh You can access the hadoop namenode: http://external_namenode_ip:50070 To get impala shell (on datanodes): $ docker exec -it impalad impala-shell.sh * you can also add Hue. but the hue.ini file should be external and mounted to the docker container. Here is an example: * Hue should be run after all the Impala cluster is set. Hue should run on "namenode" instance. * The attached hue.ini ( hue/conf/hue.ini), works with Imapala. the conf looks for "worker1" instance (impalad). You can change the hue.ini file to look for another imapald instance name. * create a "hue" database (after running postgresql container). $ docker exec postgresql psql -U hiveuser -d metastore -c "create database hue;" * Run Apache Hue: $ docker run --net=host --name hue -v ~/hue.ini:/usr/share/hue/desktop/conf/hue.ini --restart always -d gethue/hue:latest * You can access the Hue: http://external_namenode_ip:8888 ## All docker images can be found here: https://hub.docker.com/u/yarivgraf #### Due to the image size of the Impala binaries, they will be excluded from this repository.