# showcase_rust_riccsd
**Repository Path**: ajz34/showcase_rust_riccsd
## Basic Information
- **Project Name**: showcase_rust_riccsd
- **Description**: Evaluates restricted RI-CCSD energy by pure Rust.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-01-21
- **Last Updated**: 2025-01-21
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Showcase of RI-CCSD with rust
This program evaluates restricted RI-CCSD energy. By pure Rust code.
> To non-chemists: RI-CCSD can be seen as a group of dense 2-4 dimension tensor numerical computations. Most of tasks in RI-CCSD can be converted to matrix multiplication, so it is mostly compute-bounded.
>
> RI-CCSD is 4-dimension problem in it's nature; coding with libraries that focus on 2-dimension matrices is usually not convenient, if not impossible.
>
> Refer incomplete article list ([Q-Chem](https://dx.doi.org/10.1063/1.4820484), [Psi4 fnocc](https://dx.doi.org/10.1021/ct400250u), [FHI-Aims](https://dx.doi.org/10.1021/acs.jctc.8b01294), [Gamess US](https://dx.doi.org/10.1021/acs.jctc.1c00389), to name a few) to interested readers.
## Contents
~ 600 lines of code [riccsd.rs](src/riccsd.rs), with comments that how numpy implements with `np.einsum`.
This showcase uses [RSTSR](https://github.com/ajz34/rstsr) as tensor library with [commit 403e815](https://github.com/ajz34/rstsr/commit/403e815f9014f60346716b4fc754b78e0006db9f). This crate is under development, and currently has not been published to crates.io.
This showcase hope to show that, with some sugar from tensor (n-dimensional array) library, rust can handle problems that most FLOPs come from matmul of large dense matrices, with
- acceptable lines of code (maybe 1.5-4 times of numpy if `np.einsum` is not allowed)
- acceptable efficiency
- rayon parallel with acceptable unsafe mutables
In other words, rust is a proper candidate for a few scientific computing tasks, to balance program efficiency / code readibility / development efficiency / memory reliablity.
**This conclusion is not trivial**, and existing tensor libraries in rust language seem not fully prepared for this task; so a showcase project is here to demonstrate the possibility.
## Efficiency demonstration
Computation device information:
- personal computer
- AMD Ryzen 7945HX, 16 physical cores
- only one NUMA node
- 2 x 32 GB memory (5600 MT/s)
System information:
- (H2O)10 cluster, PP5 structure from [10.1021/jp104865w](https://dx.doi.org/10.1021/jp104865w).
- basis: cc-pVDZ
- auxiliary basis: cc-pVDZ-ri (both for SCF and CCSD, which is not recommended for real-world evaluation, but this project is only efficiency benchmark)
- $n_\mathrm{occ} = 40$ (frozen core), $n_\mathrm{vir} = 190$, $n_\mathrm{aux} = 820$.
| | this showcase | Psi4 | PySCF |
|--|--|--|--|
| corr eng (a.u.) | -2.1735512 | -2.1735494 | -2.1735499 |
| time each iter (sec) | ~ 18.5 | ~ 19.0 | ~ 29.5 |
| version | - | 1.9.1 (conda-forge) | 2.7.0 (pypi) |
| math library | OpenBLAS (compiled) | Intel OneAPI (conda-forge) | OpenBLAS (pypi) |
| math library threading | pthread | TBB | serial |
| algorithms | DF | DF (fnocc) | Conv |
*DF* refers to density fitting integral and algorithms, *Conv* refers to conventional integral.
Some important notes:
- Psi4 uses Intel OneAPI (MKL), which is not very efficienct on AMD CPUs, so this is actually not fair comparasion. Estimated 20% efficiency boost if using OpenBLAS.
- Psi4 have multiple CC engines. FNOCC is more efficient, while OCC have more functionalities.
- By comparing to conventional integral algorithms, density fitting (RI-CCSD) actually increases FLOPs, but only decreases memory footprints for large species (if I get both RI-CCSD and Conv-CCSD algorithms correctly).
## Details of Efficiency
- Time of each iter: ~ 18.5 sec
- $O(n_\mathrm{occ}^3 n_\mathrm{vir}^3)$ term: ~ 6.0 sec
- FLOPs estimation: slightly larger than $2 \times 4 \times n_\mathrm{occ}^3 n_\mathrm{vir}^3 = 3.27 \ \mathrm{T}$
- ~ 540 GFLOP/sec, 48% CPU maximum L1 bandwidth
- $O(n_\mathrm{occ}^2 n_\mathrm{vir}^4)$ term (pp-Ladder): ~ 8.8 sec
- FLOPs estimation: slightly larger than $2 \times (n_\mathrm{vir}^4 n_\mathrm{aux} + 0.5 \times n_\mathrm{occ}^2 n_\mathrm{vir}^4) = 3.98 \ \mathrm{T}$
- ~ 450 GFLOP/sec, 40% CPU maximum L1 bandwidth
We expect 50% efficiency usage is achievable, but that requires more fine-tuned code.
This project has not optimized for lowering memory footprints. This code accepts dupilcating some $O(n_\mathrm{occ}^2 n_\mathrm{vir}^2)$ and $O(n_\mathrm{vir}^2 n_\mathrm{aux})$ tensors. This project also does not use advanced iteration drivers (DIIS), so more iterations than usual is expected.
## To reproduce
This project is only for efficiency and code style demonstration. Usability is not the first concern.
Binary file is available for this showcase. Refer to [release page](https://github.com/ajz34/showcase_rust_riccsd/releases/tag/v0.1).
To use this binary, some preparation is required:
```bash
export RAYON_NUM_THREADS=16 # number of parallel
export RUST_MIN_STACK=16777216 # could be larger if stack overflow
export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH
./showcase_rust_riccsd_glibc_2.17
```
This project requires the user (more details in [env file](env.sh) or [vscode setting](.vscode/settings.json))
- Provide `libopenblas.so` (pthread scheme) in `$LD_LIBRARY_PATH`. Due to how rust's FFI works, OpenMP compiled OpenBLAS does not work;
- Provide `*.npy` files, in pyscf convention (see [python_scripts](python_scripts) for details):
- mo_coeff.npy (in c-contiguous, shape (nao, nmo))
- mo_energy.npy
- mo_coeff.npy
- cderi.npy (in lower-triangular packed AO basis)