# simtight **Repository Path**: shi-feng-logic/simtight ## Basic Information - **Project Name**: simtight - **Description**: Synthesisable SIMT-style RISC-V GPGPU - **Primary Language**: Verilog - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 0 - **Created**: 2025-04-18 - **Last Updated**: 2026-01-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SIMTight SIMTight is a synthesisable GPGPU core implementing the _Single Instruction Multiple Threads (SIMT)_ execution model, featuring: * RISC-V instruction set (`rv32ima_zfinx_xcheri`) * Low-area design with high IPC on classic GPGPU workloads * Dynamic scalarisation (automatic detection of scalar behaviour in hardware without ISA/compiler mods) * Parallel scalar/vector pipelines, exploiting scalarisation for increased throughput * Register file and cache compression, exploiting scalarisation for reduced on-chip storage and energy * Strong [CHERI](http://cheri-cpu.org) memory safety and isolation * Runs [CUDA-like C++ library](doc/NoCL.md) and [benchmark suite](apps/) (in pure capability mode when CHERI enabled) Further details about SIMTight can be found in the following documents. * *Advanced Dynamic Scalarisation for RISC-V GPGPUs*, ICCD 2024 ([paper](https://www.repository.cam.ac.uk/handle/1810/373257), [slides](doc/iccd2024-slides.pdf)) * *CHERI-SIMT report: implementing capability memory protection in GPGPUs*, Technical Report ([report](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-997.html)) SIMTight was developed on the [CAPcelerate project](https://ctsrd-cheri.github.io/capcelerate-website/), part of the UKRI's Digital Security by Design programme. ## Evaluation SoC The SIMTight evaluation SoC consists of a host CPU and a 32-lane 64-warp streaming multiprocessor sharing DRAM, both supporting the CHERI-RISC-V ISA. A sample project is included for the [DE10-Pro](http://de10-pro.terasic.com) ([revD](de10-pro/) and [revE](de10-pro-e/)) FPGA development board.

## Dependencies We'll need Verilator, a RISC-V compiler, and GHC 9.2.1 or later. On Ubuntu 20.04 or 22.04, we can do: ```sh $ sudo apt install verilator $ sudo apt install gcc-riscv64-unknown-elf $ sudo apt install libgmp-dev ``` For GHC 9.2.1 or later, [ghcup](https://www.haskell.org/ghcup/) can be used. If you're having difficulty meeting the dependencies, please use our [docker container](docker/): simply type `make shell` after a recursive clone of this repo. ## Getting started Recursively clone the repo: ```sh $ git clone --recursive https://github.com/CTSRD-CHERI/SIMTight ``` Inside the repo, there are various things to try. For example, to build and run the SIMTight simulator: ```sh $ cd sim $ make $ ./sim & ``` With the simulator running in the background, we can build and run the test suite: ```sh $ cd apps/TestSuite $ make test-cpu-sim # Run on the CPU $ make test-simt-sim # Run on the SIMT core ``` Alternatively, we can run one of the SIMT kernels: ```sh $ cd apps/Samples/Histogram $ make RunSim $ ./RunSim ``` To run all tests and benchmarks, we can use the [test](test/test.sh) script. This script will launch the simulator automatically, so we first make sure it's not already running. ```sh $ killall sim $ cd test $ ./test.sh # Run in simulation ``` To build an FPGA image for the [DE10-Pro revE](http://de10-pro.terasic.com) board (Quartus 21.3pro or later recommended): ```sh $ cd de10-pro-e $ make # Assumes quartus is in your PATH $ make download-sof # Assumes DE10-Pro revE is connected via USB ``` We can now run a SIMT kernel on FPGA: ```sh $ cd apps/Samples/Histogram $ make $ ./Run ``` To run the test suite and all benchmarks on a DE10-Pro revE FPGA: ```sh $ cd test $ ./test.sh --fpga-e # Assumes FPGA image built and FPGA connected via USB ``` Use the `--stats` option to generate performance stats. ## Enabling CHERI :cherries: To enable CHERI, some additional preparation is required. First, edit [inc/Config.h](inc/Config.h) and apply the following settings: * `#define EnableCHERI 1` * `#define EnableTaggedMem 1` * `#define UseClang 1` Second, install the CHERI-Clang compiler using our [script](cheri-tools/build-cheri.sh). Assuming all of [cheribuild's dependencies](https://github.com/CTSRD-CHERI/cheribuild#pre-build-setup) are met, we can simply do: ```sh $ cd cheri-tools $ ./build-cheri.sh ``` This will install the compiler into `$(pwd)/cheri/output/sdk/bin`, which we can then add to our `PATH`: ```sh export PATH=$(pwd)/cheri/output/sdk/bin:$PATH ``` If you're having difficulty meeting any of [cheribuild's dependencies](https://github.com/CTSRD-CHERI/cheribuild#pre-build-setup), please use our [docker container](docker/). We musn't forget to `make clean` in the root of the SIMTight repo any time [inc/Config.h](inc/Config.h) is changed. At this point, all of the standard build instructions should work as before. CHERI instructions for getting and setting bounds on capabilities are quite expensive in terms of logic area and typically not performance critical. Therefore, it can be useful to share bounds getting/setting logic between vector lanes: * `#define SIMTUseSharedBoundsUnit 1` Various optimisations are enabled by this setting. It leads to a large reduction in area overhead, at almost no performance cost across the benchmark suite. Another option that reduces the area overhead of CHERI is: * `#define SIMTUseFixedPCC 1` But beware, this setting removes some CHERI functionality. Specifically, it tells the SIMT core to ignore changes to the bounds and permissions of the PCC. Once the bounds and permissions of the PCC for each warp are set at kernel startup, they can never be changed. Further details for reproducing results can be found in the [CHERI-SIMT report](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-997.html). ## Enabling scalarisation Scalarisation is an optimastion that detects _uniform_ and _affine_ vectors and processes them more efficiently as scalars, reducing on-chip storage and increasing performance density. An _affine_ vector is one in which there is a constant stride between each element; a _uniform_ vector is an affine vector where the stride is zero, i.e. all elements are equal. SIMTight implements _dynamic scalarisation_ (i.e. in hardware, at runtime), and it can be enabled separately for the integer register file and the register file holding capability metadata. To enable scalarisation of both register files, edit [inc/Config.h](inc/Config.h) and apply the following settings: * `#define SIMTEnableRegFileScalarisation 1` * `#define SIMTEnableCapRegFileScalarisation 1` These options alone only enable scalarisation of uniform vectors. To enable scalariastion of affine vectors, apply the following settings * `#define SIMTEnableAffineScalarisation 1` Note that affine scalarisation only applies to the integer register file. SIMTight exploits scalarisation to reduce register file storage requirements. Hence, it is desirable to set the number of physical registers to a value smaller than the number of architectural registers. In cases where scalarisation cannot prevent overflow of the physical register file, the hardware implements _dynamic register spilling_, where registers are evicted to and fetched from DRAM as required. In the default configuration, the size of the physical register files is equal to the number of architectural registers (so dynamic spilling is not required): * `#define SIMTLogRegFileSize 11` * `#define SIMTLogCapRegFileSize 11` At the moment we have two spill policies: pick-first and least-recently-used. To enable the latter: * `#define SIMTUseLRUSpill 1` When CHERI is enabled, it's possible to share vector register memory between the integer and capability metadata register files. * `#define SIMTUseSharedVecScratchpad 1` In this case, both register file sizes must be defined the same. This option causes a one cycle pipeline bubble when loading a capability metadata vector from the register file. SIMTight also supports an experimental _scalarised vector store buffer_ (also referred to as the compressed stack cache) to reduce the cost of compiler-inserted register spills (as opposed to hardware-inserted dynamic spills), at low hardware cost, which can be enabled as follows. * `#define SIMTEnableSVStoreBuffer 1` As well as reducing on-chip storage, scalarisation is also exploited to improve runtime performance: enabling a scalar pipeline in the SIMT core allows an entire warp to be executed on a single execution unit in a single cycle (when the instruction is detected as scalarisable), _and operates in parallel with the main vector pipeline_. For many workloads, this increases perforance density significantly. * `#define SIMTEnableScalarUnit 1` To enable the intial value optimisation (also referred to as the null value optimisation) in the capability metadata register file: * `#define SIMTCapRFUseInitValOpt 1` This a simple form of partial scalarisation allowing compact storage of vectors that can be partioned into an arbitrary scalar value and the initial value (null capability metadata in this case) using a bit mask.

Supported by

Digital Security by Design (DSbD) Programme