# numa-spmv

**Repository Path**: ssslab/numa-spmv

## Basic Information

- **Project Name**: numa-spmv
- **Description**: This work presents a NUMA-Aware optimization technique for the SpMV operation on the Phytium 2000+ ARMv8-based 64-core process.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-10-18
- **Last Updated**: 2022-01-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: C语言

## README

<!-- NUMA-Aware optimazition of SpMV

Installation environment:

The hypergraph partitioning tool Metis needs to be installed.
Libnuma and numactl need to be installed.

Compile:

gcc -o pthread_metis_numa pthread_metis_numa.c -lnuma -lpthread -lm -O3

Run:

./pthread_metis_numa matrix threads_numbers（Example：./pthread_metis_numa M6.mtx 16） -->

# NUMA-Aware SpMV

This work presents a NUMA-Aware optimization technique for the SpMV operation on the Phytium 2000+ ARMv8-based 64-core process.

## Paper information
--------------------
Yu X., Ma H., Qu Z., Fang J., Liu W. (2021) NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures. In: He X., Shao E., Tan G. (eds) Network and Parallel Computing. NPC 2020. Lecture Notes in Computer Science, vol 12639. Springer, Cham. https://doi.org/10.1007/978-3-030-79478-1_20

## Contact us
-------------
If you have any questions about running the code, please contact Xiaosong Yu.

E-mail: 2019215847@student.cup.edu.cn

## Introduction
---------------
The sparse matrix-vector multiplication (SpMV) operation multiples a sparse matrix $A$ with a dense vector $x$ and gives a resulting dense vector $y$. It is one of the level 2 sparse basic linear algebra subprograms (BLAS), and is one of the most frequently called kernels in the field of scientific and engineering computations. Its performance normally has a great impact on sparse iterative solvers such as conjugate gradient (CG) method and its variants. 

To represent the sparse matrix, many storage formats and their SpMV algorithms have been proposed to save memory and execution time. Since SpMV generally implements algorithms with a very low ratio of floating-point calculations to memory accesses, and its accessing patterns can be very irregular, it is a typical memory-bound and latency-bound algorithm. Currently, many SpMV optimization efforts have achieved performance improvements to various degrees, but lack consideration on utilizing NUMA (non-uniform memory access) characteristics of a wide range of modern processors, such as ARM CPUs.

To obtain scale-out benefits on modern multi-core and many-core processors, NUMA architectures is often an inevitable choice. Most modern x86 processors (e.g., AMD EPYC series) and ARM processors (e.g., Phytium 2000+) utilize NUMA architecture for building a processor with tens of cores. To further increase the number of cores in a single node, multiple (typically two, four or eight) such processor modules are integrated onto a single motherboard and are connected through high-speed buses. But such scalable design often brings stronger NUMA effects, i.e., giving noticeable lower bandwidth and larger latency when cross-NUMA accesses occur.

To improve the SpMV performance on modern processors, we in this work develop a NUMA-Aware SpMV approach. We first reorder the input sparse matrix with hypergraph partitioning tools, then allocate a row block of $A$ and the corresponding part of $x$ for different NUMA nodes, and pin threads onto hardware cores of the NUMA nodes for running parallel SpMV operation. 
Because the reordering technique can organize the non-zeros in $A$ on diagonal blocks and naturally brings the affinity between the blocks and the vector $x$, the data locality of accessing $x$ can be significantly improved.

We benchmark 15 sparse matrices from the SuiteSparse Matrix Collection on a 64-core ARMv8-based Phytium 2000+ processor. 
We set the number of hypergraph partitions to 2, 4, 8, 16, 32 and 64, and set the number of threads to 8, 16, 32, and 64, then measure the performance of their combinations. The experimental results show that, compared to classical OpenMP SpMV implementation, our NUMA-Aware approach greatly improves the SpMV performance by 1.76x on average (up to 2.88x).

## Installation environment
---------------------------
The graph partitioning tool Metis needs to be installed.
Libnuma and numactl need to be installed.

## Execution of NUMA-Aware SpMV
-------------------------------
> make

> ./pthread_metis_numa matrix threads_numbers（Example：./pthread_metis_numa M6.mtx 16）