# pmem-shuffle-test **Repository Path**: mumu7/pmem-shuffle-test ## Basic Information - **Project Name**: pmem-shuffle-test - **Description**: pmem shuffle for spark3.1.1 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-04-26 - **Last Updated**: 2022-05-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PMem Shuffle for Apache Spark Guide [1. PMem Shuffle introduction](#pmem-shuffle-introduction) [2. Recommended HW environment](#recommended-hw-environment) [3. Install and configure PMem](#install-and-configure-pmem) [4. Configure and Validate RDMA](#configure-and-validate-rdma) [5. Install dependencies for PMem Shuffle](#install-dependencies-for-pmem-shuffle) [6. Install PMem Shuffle for Spark](#install-pmem-shuffle-for-spark) [7. PMem Shuffle for Spark Testing](#pmem-shuffle-for-spark-testing) [8. Trouble Shooting](#trouble-shooting) [Reference](#reference) PMem Shuffle for Spark (previously Spark-PMoF) depends on multiple native libraries like libfabrics, libcuckoo, PMDK. This enabling guide covers the installing process for the time being, but it might change as the install commands and related dependency packages for the 3rd party libraries might vary depending on the OS version and distribution you are using. Yarn, HDFS, Spark installation and configuration is out of the scope of this document. ## Online Documentation You can find the all the PMem Shuffle documents on the [project web page](https://oap-project.github.io/pmem-shuffle/). ## 1. PMem Shuffle introduction Intel Optane DC persistent memory is the next-generation storage at memory speed. It closes the performance gap between DRAM memory technology and traditional NAND SSDs. Remote Persistent Memory extends PMem usage to new scenario, lots of new usage cases & value proposition can be developed. Spark shuffle is a high cost operation as it issues a great number of small random disk IO, serialization, network data transmission, and thus contributes a lot to job latency and could be the bottleneck for workloads performance. PMem Shuffle for spark (previously Spark PMoF) ) is a Persistent Memory over Fabrics (PMoF) plugin for Spark shuffle, which leverages the RDMA network and remote persistent memory (for read) to provide extremely high performance and low latency shuffle solutions for Spark to address performance issues for shuffle intensive workloads. PMem Shuffle brings follow benefits: - Leverage high performance persistent memory as shuffle media as well as spill media, increased shuffle performance and reduced memory footprint - Using PMDK libs to avoid inefficient context switches and memory copies with zero-copy remote access to persistent memory. - Leveraging RDMA for network offloading The Figure 1 shows the high level architecture of PMem Shuffle, it shows how data flows between Spark and shuffle devices in PMem Shuffle for spark shuffle and Vanilla Spark. In this guide, we will introduce how to deploy and use PMem Shuffle for Spark. ![architecture](./docs/image/RPMem_shuffle_architecture.png) Figure 1: PMem Shuffle for Spark ## 2. Recommended HW environment ### 2.1. System Configuration -------------------------- #### 2.1.1 HW and SW Configuration -------------------------- A 4x or 3x Node cluster is recommended for a proof of concept tests, depending your system configurations, if using 3 nodes cluster, the Name node and Spark Master node can be co-located with one of the Hadoop data nodes. **Hardware:** - Intel® Xeon™ processor Gold 6240 CPU @ 2.60GHz, 384GB Memory (12x 32GB 2666 MT/s) or 192GB Memory (12x 16GB 2666MT/s) - An RDMA capable NIC, 40Gb+ is preferred. e.g., 1x Intel X722 NIC or Mellanox ConnectX-4 40Gb NIC - RDMA cables: - Mellanox MCP1600-C003 100GbE 3m 28AWG - Shuffle Devices: - 1x 1TB HDD for shuffle (baseline) - 4x 128GB Persistent Memory for shuffle - 4x 1T NVMe for HDFS **Switch**: - Arista 7060 CX2 (7060CX2-32S-F) 100Gb switches was used - Please refer to section 4.2 for configurations **Software:** - Hadoop 2.7 - Spark 3.0.0 - Fedora 29 with ww08.2019 BKC ### 2.2. Recommended RDMA NIC ------------------------- PMem Shuffle is using HPNL () for network communication, which leverages libfabric for efficient network communication, so a RDMA capable NIC is recommended. Libfabric supports RoCE, iWrap, IB protocol, so various RNICs with different protocol can be used. ### 2.3 Recommended PMEM configuration ----------------------------------- It is recommended to install 4+ PMem DIMMs on the SUT, but you can adjust the numbers accordingly. In this enabling guide, 4x 128GB PMEMM was installed on the SUT as an exmaple. ### 2.4 Recommended PMEM BKC (optional) -------------------------- This development guide was based on ww08.2019 BKC (best known configuration). Please contact your HW vendor for latest BKC. Please refer to backup if you do not have BKC access. BKC installation/enabling or FW installation is out of the scope of this guide. ## 3. Install and configure PMEM (example) 1) Please install *ipmctl* and *ndctl* according to your OS version 2) Run *ipmctl show -dimm* to check whether dimms can be recognized 3) Run *ipmctl create -goal PersistentMemoryType=AppDirect* to create AD mode 4) Run *ndctl list -R*, you will see **region0** and **region1**. 5) Assume you have 4x PMEM installed on 1 node. a. Run *ndctl create-namespace -m devdax -r region0 -s 120g* b. Run *ndctl create-namespace -m devdax -r region0 -s 120g* c. Run *ndctl create-namespace -m devdax -r region1 -s 120g* d. Run *ndctl create-namespace -m devdax -r region1 -s 120g* This will create four namespaces, namely /dev/dax0.0, /dev/dax0.1, /dev/dax1.0, /dev/dax1.1 in that node, and it will be used as PMem Shuffle media. You can change your configuration (namespaces numbers, size) accordingly. 6) Step 5 is required only when running this solution over RDMA is considered. Otherwise PMem can be initialized in fsdax mode. a. Run *ndctl create-namespace -m fsdax -r region0 -s 120g* b. Run *ndctl create-namespace -m fsdax -r region0 -s 120g* c. Run *ndctl create-namespace -m fsdax -r region1 -s 120g* d. Run *ndctl create-namespace -m fsdax -r region1 -s 120g* Four namespaces /dev/pmem0, /dev/pmem0.1, /dev/pmem1, /dev/pmem1.1 are created. Note that the namespace name might vary due to existing namespaces. In general, the name is consistent with the pattern /dev/pmem*. After creating the namespace in fsdax mode, the namespace is ready for a file system. Here we use Ext4 file system in enabling. e. mkfs.ext4 /dev/pmem0 f. mkfs.ext4 /dev/pmem0.1 g. mkfs.ext4 /dev/pmem1 h. mkfs.ext4 /dev/pmem1.1 Create directories and mount file system to them. To get the DAX functionality, mount the file system with dax option. i. mkdir /mnt/pmem0 && mount -o dax /dev/pmem0 /mnt/pmem0 j. mkdir /mnt/pmem0.1 && mount -o dax /dev/pmem0.1 /mnt/pmem0.1 k. mkdir /mnt/pmem1 && mount -o dax /dev/pmem1 /mnt/pmem1 l. mkdir /mnt/pmem1.1 && mount -o dax /dev/pmem1.1 /mnt/pmem1.1 Change configurations accordingly. ## 4. Configure and Validate RDMA ------------------------ **Notes** This part is vendor specific, it might NOT apply to your environment, please check your switch, NIC manuals accordingly. ### 4.1 Configure and test iWARP RDMA --------------------------------- #### 4.1.1 Download rdma-core and install dependencies The rdma-core provides the necessary **userspace libraries** to test rdma connectivity with tests such as rping. Refer to latest rdma-core documentation for updated installation guidelines (https://github.com/linux-rdma/rdma-core.git). You might refer to HW specific instructions or guide to enable your RDMA NICs. Take Mellanox as an example, perform below steps to enable it: ``` bash git clone dnf install cmake gcc libnl3-devel libudev-devel pkgconfig valgrind-devel ninja-build python3-devel python3-Cython python3-docutils pandoc //change to yum on centos bash build.sh #on centos 7 yum install cmake gcc libnl3-devel libudev-devel make pkgconfig valgrind-devel yum install epel-release yum install cmake3 ninja-build pandoc ``` #### 4.1.2 Switch Configuration (optional) This part is HW specific, **please check your switch manual accordingly.** Connect the console port to PC. Username is admin. No password. Enter global configuration mode. Below example is based on Arista 7060 CX2 100Gb Switch, it is to configure the 100Gb port to work at 40Gb to match the NIC speed. *It is NOT required if your NIC and calbes are match.* **Config Switch Speed to 40Gb/s** ``` switch# enable switch# config switch(config)# show interface status ``` **Configure corresponding port to 40 Gb/s to match the NIC speed** ``` switch(config)# interface Et(num_of_port)/1 switch(config)# speed forced 40gfull ``` RoCE might have performance issues, so PFC configuration is strongly suggested. You will need to check the RDMA NIC driver manual and switch manual to configure PFC. Below is the example for ConnectX-4 and Arista 7060-CX2 switches. Below is to set the two connection ports in the same vlan and configure it in trunk mode. **Configure interface as trunk mode and add to vlan** ```bash switch(config)# vlan 1 switch(config-vlan-1)# switch(config)# interface ethernet 12-16 switch(config-if-Et12-16)# switchport trunk allowed vlan 1 switch (config-if-et1) # **priority-flow-control on** switch (config-if-et1) # **priority-flow-control priority 3 no-drop** ``` #### 4.1.3 Download and install drivers #### A. Example: Mellanox Enabling RoCE V2 RDMA (Optional) There are lots of packages need to be installed for dependency, please refer to your RDMA NIC's manualls to install it correctly. ``` bash yum install atk gcc-gfortran tcsh gtk2 tcl tk ``` please install NIC drivers accordingly. ``` bash # Download MLNX_OFED_LINUX-4.7-3.2.9.0-* from https://community.mellanox.com/s/article/howto-install-mlnx-ofed-driver # e.g., wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-/MLNX_OFED_LINUX---.tgz . tar zxf MLNX_OFED_LINUX-4.7-3.2.9.0-* cd MLNX_OFED_LINUX-4.7-3.2.9.0- ./mlnxofedinstall --add-kernel-support. # The process might interpret and promote you to install dependencies. Install dependencies and try again # This process will take some time. ``` ``` bash tar zxf MLNX_OFED_LINUX-4.7-3.2.9.0-* cd MLNX_OFED_LINUX-4.7-3.2.9.0- ./mlnxofedinstall --add-kernel-support. # The process might interpret and promote you to install dependencies. Install dependencies and try again # This process will take some time. ** ``` Restart the driver: ``` bash /etc/init.d/openibd restart ``` Might need to unload the modules if it is in use. Make sure the that the field link_layer is “Ethernet”. Then you can use following command to get the device name. #### B. Enable PFC (Priority Flow Control) to guarantee stable performance (optional) Then you can use following command to get the device name If you’re using Mellanox NIC, PFC is a must to guarantee stable performance. Fetch RDMA info with rdma command: ```bash rdma link 0/1: i40iw0/1: state DOWN physical_state NOP 1/1: i40iw1/1: state ACTIVE physical_state NOP 2/1: mlx5_0/1: state DOWN physical_state DISABLED netdev ens803f0 3/1: mlx5_1/1: state ACTIVE physical_state LINK_UP netdev ens803f1 lspci | grep Mellanox 86:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] 86:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] ``` Set PFC: ```bash /etc/init.d/openibd restart mlnx_qos -i ens803f1 --pfc 0,0,0,1,0,0,0,0 modprobe 8021q vconfig add ens803f1 100 ifconfig ens803f1.100 $ip1/$mask up //change to your own IP ifconfig ens803f1 $ip2/$mask up //Change to your own IP for i in {0..7}; do vconfig set_egress_map ens803f1.100 $i 3 ; done tc_wrap.py -i ens803f1 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 ``` Modify the IP address part based on your environment and execute the script. #### 4.1.4 Check RDMA module Make sure the following modules are loaded: ```bash modprobe ib_core i40iw iw_cm rdma_cm rdma_ucm ib_cm ib_uverbs ``` #### 4.1.5 Validate RDMA functionalities Check that you see your RDMA interfaces listed on each server when you run the following command: **ibv_devices** Check with rping for RDMA connectivity between target interface and client interface.  1) Assign IPs to the RDMA interfaces on Target and Client. 2) On Target run:rping -sdVa <Target IP> 3) On Client run: rping -cdVa <Target IP> Example: On the server side: ```bash rping -sda $ip1 created cm_id 0x17766d0 rdma_bind_addr successful rdma_listen accepting client connection request cq_thread started. recv completion Received rkey 97a4f addr 17ce190 len 64 from peer cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x7fe9ec000c90 (child) ESTABLISHED Received rkey 96b40 addr 17ce1e0 len 64 from peer server received sink adv rdma write from lkey 143c0 laddr 1771190 len 64 rdma write completion rping -sda $ip2 created cm_id 0x17766d0 rdma_bind_addr successful rdma_listen … accepting client connection request cq_thread started. recv completion Received rkey 97a4f addr 17ce190 len 64 from peer cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x7fe9ec000c90 (child) ESTABLISHED … Received rkey 96b40 addr 17ce1e0 len 64 from peer server received sink adv rdma write from lkey 143c0 laddr 1771190 len 64 rdma write completion … ``` On Client run: rping -cdVa <Target IP> ```bash # Client side use .100 ip 172.168.0.209 for an example rping -c -a 172.168.0.209 -v -C 4 ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu ``` Please refer to your NIC manuual for detail instructions on how to validate RDMA works. ## 5. Install dependencies for PMem Shuffle --------------------------- We have provided a Conda package which will automatically install dependencies needed for PMem Shuffle, refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information. If you have finished [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md), you can find compiled OAP jars in `$HOME/miniconda2/envs/oapenv/oap_jars/`, and skip this session and jump to [6.Install PMem Shuffle for Spark](#install-pmem-shuffle-for-spark) ### 5.1 Install HPNL () ---------------------------------------------------------- HPNL is a fast, CPU-Efficient network library designed for modern network technology. HPNL depends on Libfabric, which is protocol independent, it supports TCP/IP, RoCE, IB, iWRAP etc. Please make sure the Libfabric is installed in your setup. Based on this [issue](https://github.com/ofiwg/libfabric/issues/5548), please make sure NOT to install Libfabric 1.9.0. You might need to install automake/libtool first to resolve dependency issues. ```bash git clone https://github.com/ofiwg/libfabric.git cd libfabric git checkout v1.6.0 ./autogen.sh ./configure --disable-sockets --enable-verbs --disable-mlx make -j && sudo make install ``` #### 5.1.1 Build and install HPNL Assume *Project_root_path* is HPNL folder’s path, *HPNL* here. ```bash sudo apt-get install cmake libboost-dev libboost-system-dev #Fedora dnf install cmake boost-devel boost-system git clone https://github.com/Intel-bigdata/HPNL.git cd HPNL git checkout origin/spark-pmof-test --track git submodule update --init --recursive mkdir build; cd build cmake -DWITH_VERBS=ON .. make -j && make install cd ${project_root_path}/java/hpnl mvn install ``` ### 5.2 install basic C library dependencies -------------------------- ```bash yum install -y autoconf asciidoctor kmod-devel.x86\_64 libudev-devel libuuid-devel json-c-devel jemalloc-devel yum groupinstall -y "Development Tools" ``` ### 5.3 install ndctl ------------------- This can be installed with your package managmenet tool as well. ```bash git clone https://github.com/pmem/ndctl.git cd ndctl git checkout v63 ./autogen.sh ./configure CFLAGS='-g -O2' --prefix=/usr --sysconfdir=/etc --libdir=/usr/lib64 make -j make check make install ``` ### 5.4 install PMDK ------------------ ```bash yum install -y pandoc git clone https://github.com/pmem/pmdk.git cd pmdk git checkout tags/1.8 make -j && make install export PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig/:$PKG_CONFIG_PATH echo “export PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig/:$PKG_CONFIG_PATH” > /etc/profile.d/pmdk.sh ``` ### 5.5 Install RPMem extension ----------------------------------------------- ```bash git clone https://github.com/efficient/libcuckoo cd libcuckoo mkdir build cd build cmake -DCMAKE_INSTALL_PREFIX=/usr/local -DBUILD_EXAMPLES=1 -DBUILD_TESTS=1 .. make all && make install git clone -b https://github.com/intel-bigdata/OAP.git cd OAP/oap-shuffle/RPMem-shuffle mvn install -DskipTests ``` ## 6. Install PMem Shuffle for Spark --------------------------- ### 6.1 Configure RPMem extension for spark shuffle in Spark -------------------------------------------------------- PMem Shuffle for spark shuffle is designed as a plugin to Spark. Currently the plugin supports Spark 3.0.0 and works well on various Network fabrics, including Socket, RDMA and Omni-Path. There are several configurations files needs to be modified in order to run PMem Shuffle. #### Prerequisite Use below command to remove original initialization of one PMem, this is a **MUST** step, or RPMemShuffle won’t be able to open PMem devices. ```bash pmempool rm ${device_name} #example: pmempool rm /dev/dax0.0 ``` If you install [OAP Conda package](https://anaconda.org/intel/oap), you can use below command to remove original initialization of one PMem. ```shell script export LD_LIBRARY_PATH=$HOME/miniconda2/envs/oapenv/lib/:$LD_LIBRARY_PATH $HOME/miniconda2/envs/oapenv/bin/pmempool rm ${device_name} ``` **Refer to the Reference section for detail descrption of each parameter.** #### Enable RPMemShuffle ```bash spark.shuffle.manager org.apache.spark.shuffle.pmof.PmofShuffleManager spark.driver.extraClassPath /$path/oap-shuffle/RPMem-shuffle/core/target/oap-rpmem-shuffle-java--jar-with-dependencies.jar spark.executor.extraClassPath /$path/oap-shuffle/RPMem-shuffle/core/target/oap-rpmem-shuffle-java--jar-with-dependencies.jar ``` #### Switch On/Off PMem and RDMA ```bash spark.shuffle.pmof.enable_rdma true spark.shuffle.pmof.enable_pmem true ``` #### Add PMem information to spark config Explanation: spark.shuffle.pmof.pmem\_capacity: the capacity of one PMem device, this value will be used when register PMem device to RDMA. spark.shuffle.pmof.pmem\_list: a list of all local PMem device, make sure your per physical node executor number won’t exceed PMem device number, or one PMem device maybe opened by two spark executor processes and this will leads to a PMem open failure. spark.shuffle.pmof.dev\_core\_set: a mapping of which core range will be task set to which PMem device, this is a performance optimal configuration for better PMem numa accessing. spark.io.compression.codec: use “snappy” to do shuffle data and spilling data compression, this is a **MUST** when enabled PMem due to a default LZ4 ByteBuffer incompatible issue. ```bash spark.shuffle.pmof.pmem_capacity ${total_size_of_one_device} spark.shuffle.pmof.pmem_list ${device_name},${device_name},… spark.shuffle.pmof.dev_core_set ${device_name}:${core_range};… #example: /dev/dax0.0:0-17,36-53;/dev/dax0.2:0-17,36-53 spark.io.compression.codec snappy ``` #### Memory configuration suggestion Suitable for any release before OAP 0.8. In OAP 0.8 and later release, the memory footprint of each core is reduced dramatically and the formula below is not applicable any more. Spark.executor.memory must be greater than shuffle\_block\_size \* numPartitions \* numCores \* 2 (for both shuffle and external sort), for example, default HiBench Terasort numPartition is 200, and we configured 10 cores each executor, then this executor must has memory capacity greater than 2MB(spark.shuffle.pmof.shuffle\_block\_size) \* 200 \* 10 \* 2 = 8G. Recommendation configuration as below, but it needs to be adjusted accordingly based on your system configurations. ```bash Yarn.executor.num 4 // same as PMem namespaces number Yarn.executor.cores 18 // total core number divide executor number spark.executor.memory 15g // 2MB * numPartition(200) * 18 * 2 spark.yarn.executor.memoryOverhead 5g // 30% of spark.executor.memory spark.shuffle.pmof.shuffle_block_size 2096128 // 2MB – 1024 Bytes spark.shuffle.pmof.spill_throttle 2096128 // 2MB – 1024 Bytes, spill_throttle is used to // set throttle by when spill buffer data to // Persistent Memory, must set spill_throttle // equal to shuffle_block_size spark.driver.memory 10g spark.yarn.driver.memoryOverhead 5g ``` **Configuration of RDMA enabled case** spark.shuffle.pmof.node : spark nodes and RDMA ip mapping list spark.driver.rhost / spark.driver.rport : Specify spark driver RDMA IP and port ```bash spark.shuffle.pmof.server_buffer_nums 64 spark.shuffle.pmof.client_buffer_nums 64 spark.shuffle.pmof.map_serializer_buffer_size 262144 spark.shuffle.pmof.reduce_serializer_buffer_size 262144 spark.shuffle.pmof.chunk_size 262144 spark.shuffle.pmof.server_pool_size 3 spark.shuffle.pmof.client_pool_size 3 spark.shuffle.pmof.node $HOST1-$IP1,$HOST2-$IP2//Host-IP pairs, $hostname-$ip spark.driver.rhost $IP //change to your host IP spark.driver.rport 61000 ``` **FSDAX** Use `spark.shuffle.pmof.pmpool_size` to specify the size of created shuffle file. The size should obey the rule: `max(spark.shuffle.pmof.pmpool_size) < size_of_namespace * 0.9`. It's because we need to reserve some space in fsdax namespace for meta data. **Misc** The config `spark.sql.shuffle.partitions` is required to set explicitly, it's suggested to use default value `200` unless you're pretty sure what's the meaning of this value. ## 7. PMem Shuffle for Spark Testing ----------------------------- Pmem shuffle extension have been tested and validated with Terasort and Decision support workloads. ### 7.1 Decision support workloads ------------------------------- The Decision support workloads is a decision support benchmark that models several general applicable aspects of a decision support system, including queries and data maintenance. #### 7.1.1 Download spark-sql-perf The link is and follow README to use sbt build the artifact. #### 7.1.2 Download the kit As per instruction from spark-sql-perf README, tpcds-kit is required and please download it from , follow README to setup the benchmark. #### 7.1.3 Prepare data As an example, generate parquet format data to HDFS with 1TB data scale. The data stored path, data format and data scale are configurable. Please check script below as a sample. ``` scala import com.databricks.spark.sql.perf.tpcds.TPCDSTables import org.apache.spark.sql._ // Set: val rootDir: String = "hdfs://${ip}:9000/tpcds_1T" // root directory of location to create data in. val databaseName: String = "tpcds_1T" // name of database to create. val scaleFactor: String = "1024" // scaleFactor defines the size of the dataset to generate (in GB). val format: String = "parquet" // valid spark format like parquet "parquet". val sqlContext = new SQLContext(sc) // Run: val tables = new TPCDSTables(sqlContext, dsdgenDir = "/mnt/spark-pmof/tool/tpcds-kit/tools", // location of dsdgen scaleFactor = scaleFactor, useDoubleForDecimal = false, // true to replace DecimalType with DoubleType useStringForDate = false) // true to replace DateType with StringType tables.genData( location = rootDir, format = format, overwrite = true, // overwrite the data that is already there partitionTables = true, // create the partitioned fact tables clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files. filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value tableFilter = "", // "" means generate all tables numPartitions = 400) // how many dsdgen partitions to run - number of input tasks. // Create the specified database sql(s"create database $databaseName") // Create metastore tables in a specified database for your data. // Once tables are created, the current database will be switched to the specified database. tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = true) ``` #### 7.1.4 Run the benchmark Launch DECISION SUPPORT WORKLOADS queries on generated data, check *benchmark.scala* below as a sample, it runs query64. ``` scala import com.databricks.spark.sql.perf.tpcds.TPCDS import org.apache.spark.sql._ val sqlContext = new SQLContext(sc) val tpcds = new TPCDS (sqlContext = sqlContext) // Set: val databaseName = "tpcds_1T" // name of database with TPCDS data. val resultLocation = "tpcds_1T_result" // place to write results val iterations = 1 // how many iterations of queries to run. val query_filter = Seq("q64-v2.4") val randomizeQueries = false def queries = { val filtered_queries = query_filter match { case Seq() => tpcds.tpcds2_4Queries case _=> tpcds.tpcds2_4Queries.filter(q => query_filter.contains(q.name)) } filtered_queries } val timeout = 24*60*60 // timeout, in seconds. // Run: sql(s"use $databaseName") val experiment = tpcds.runExperiment( queries, iterations = iterations, resultLocation = resultLocation, forkThread = true) experiment.waitForFinish(timeout) ``` #### 7.1.5 Check the result Check the result under *tpcds\_1T\_result* folder. It can be an option to check the result at spark history server. (Need to start history server by *\$SPARK\_HOME/sbin/start-history-server.sh*) ### 7.2 TeraSort ------------ TeraSort is a benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. #### 7.2.1 Download HiBench This guide uses HiBench for Terasort tests, . HiBench is a big data benchmark suite and contains a set of Hadoop, Spark and streaming workloads including TeraSort. #### 7.2.2 Build HiBench as per instructions from [build-bench](https://github.com/Intel-bigdata/HiBench/blob/master/docs/build-hibench.md). #### 7.2.3 Configuration Modify *\$HiBench-HOME/conf/spark.conf* to specify the spark home and other spark configurations. It will overwrite the configuration of *\$SPARK-HOME/conf/spark-defaults.conf* at run time. #### 7.2.4 Launch the benchmark Need to prepare the data with *\$HiBench-HOME/bin/workloads/micro/terasort/prepare/prepare.sh* Kick off the evaluation by \$HiBench-HOME/bin/workloads/micro/terasort/spark/run.sh Change directory to *\$HiBench-HOME/bin/workloads/micro/terasort/spark* and launch the *run.sh*. You can add some PMEM cleaning work to make sure it starts from empty shuffle device every test iteration. Take *run.sh* below as a sample. ```bash # ***Change below command accordingly *** ssh ${node} pmempool rm /dev/dax0.0 current_dir=`dirname "$0"` current_dir=`cd "$current_dir"; pwd` root_dir=${current_dir}/../../../../.. workload_config=${root_dir}/conf/workloads/micro/terasort.conf . "${root_dir}/bin/functions/load_bench_config.sh" enter_bench ScalaSparkTerasort ${workload_config} ${current_dir} show_bannar start rmr_hdfs $OUTPUT_HDFS || true SIZE=`dir_size $INPUT_HDFS` START_TIME=`timestamp` run_spark_job com.intel.hibench.sparkbench.micro.ScalaTeraSort $INPUT_HDFS $OUTPUT_HDFS END_TIME=`timestamp` gen_report ${START_TIME} ${END_TIME} ${SIZE} show_bannar finish leave_bench ``` #### 7.2.5 Check the result Check the result at spark history server to see the execution time and other spark metrics like spark shuffle spill status. (Need to start history server by *\$SPARK\_HOME/sbin/start-history-server.sh*) ## Reference ---------------------------- ### RPMemShuffle Spark configuration --------------------------------- Before running Spark workload, add following contents in `spark-defaults.conf`. ```bash spark.executor.instances 4 // same as total PMem namespace numbers of your cluster spark.executor.cores 18 // total core number divide executor number spark.executor.memory 70g // 4~5G * spark.executor.cores spark.executor.memoryOverhead 15g // 30% of spark.executor.memory spark.shuffle.pmof.shuffle_block_size 2096128 // 2MB – 1024 Bytes spark.shuffle.pmof.spill_throttle 2096128 // 2MB – 1024 Bytes spark.driver.memory 10g spark.yarn.driver.memoryOverhead 5g spark.shuffle.compress true spark.io.compression.codec snappy spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/oap-rpmem-shuffle-java--jar-with-dependencies.jar spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/oap-rpmem-shuffle-java--jar-with-dependencies.jar spark.shuffle.manager org.apache.spark.shuffle.pmof.PmofShuffleManager spark.shuffle.pmof.enable_rdma true spark.shuffle.pmof.enable_pmem true spark.shuffle.pmof.pmem_capacity 126833655808 // size should be same as pmem size spark.shuffle.pmof.pmem_list /dev/dax0.0,/dev/dax0.1,/dev/dax1.0,/dev/dax1.1 spark.shuffle.pmof.dev_core_set dax0.0:0-71,dax0.1:0-71,dax1.0:0-71,dax1.1:0-71 spark.shuffle.pmof.server_buffer_nums 64 spark.shuffle.pmof.client_buffer_nums 64 spark.shuffle.pmof.map_serializer_buffer_size 262144 spark.shuffle.pmof.reduce_serializer_buffer_size 262144 spark.shuffle.pmof.chunk_size 262144 spark.shuffle.pmof.server_pool_size 3 spark.shuffle.pmof.client_pool_size 3 spark.shuffle.pmof.node $host1-$IP1,$host2-$IP2//HOST-IP Pair, seperate with "," spark.driver.rhost $IP //change to your host spark.driver.rport 61000 ``` ## 8. Trouble shooting For any reason that a previous job is failed, please empty PMem spaces before another run. It's because normal space release operation might fail to be invoked for failed jobs. For devdax, use `pmempool rm {devdax-namespace}` to reset the entire namespace. For fsdax, use `rm -rf {mounted-pmem-folder}/shuffle_block*` to remove corresponding shuffle pool files. ### Reference guides (without BKC access) ----------------------------------- If you do not have BKC access, please following below official guide: (1): General PMEMM support: PMEMM support (2) PMEMM population rule: Module DIMM Population for Intel® Optane™ DC Persistent Memory (3) OS support requirement: Operating System OS for Intel® Optane™ DC Persistent Memory (4): Quick Start Guide: Provision Intel® Optane™ DC Persistent Memory