# CUDA-Fortran-2ed

**Repository Path**: mirrors_NVIDIA/CUDA-Fortran-2ed

## Basic Information

- **Project Name**: CUDA-Fortran-2ed
- **Description**: Source code from "CUDA Fortran for Scientists and Engineers, Second Edition"
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-10-21
- **Last Updated**: 2026-04-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Source Code from CUDA Fortran for Scientists and Engineers (2ed)

This repository contains the sourse code from the book
CUDA Fortran for Scientists and Engineers,
Best Practices for Efficient CUDA Fortran Programming, Second Edition,
arranged by chapter.

# Copyright and License

SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. \
SPDX-License-Identifier: Apache-2.0 

Licensed under the Apache License, Version 2.0 (the "License"); \
you may not use the files in this directory except in compliance with the License. \
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software \
distributed under the License is distributed on an "AS IS" BASIS, \
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. \
See the License for the specific language governing permissions and \
limitations under the License. 

## Part 1: CUDA Fortran Programming

### Chapter 1: Introduction
  * Section 1.3.1: `increment.f90` and `increment.cuf` demonstrate differences
  between Fortran and CUDA Fortran versions of a simple code

  * Section 1.3.2: `multiblock.cuf` demonstrates using multiple thread blocks

  * Section 1.3.3: `multidim.cuf` demonstrates how mutiple dimensions are
  accommodated in CUDA Fortran kernels

  * Section 1.3.4: `explicitInterface.cuf` demonstrates how explicit interfaces
  are used when device code is defined outside a `use`-d module

  * Section 1.3.5: `managed.cuf` and `managedImplicit.cuf` demonstrate use
  of managed memory 

  * Section 1.3.6: `multidimCUF.cuf`, `managedCUF.cuf`, and `managedCUF2.f90`
  demonstrate use of CUF kernels

  * Section 1.4.1: `deviceQuery.cuf` demonstrates how to determine device
  properties at runtime, and `pciBusID.cuf` demonstrates how to determine
  the PCI bus for a specified device

  * Section 1.5: `errorHandling.cuf`, `syncError.cuf`, and `asynError.cuf`
  demonstrate different aspects of error handling of device code

  * Section 1.7: `version.cuf` demonstrates how to determine the CUDA
  driver and CUDA Toolkit versions at runtime.

### Chapter 2: Correctness, Accuracy and Debugging

  * Section 2.1.1: `accuracy.cuf` demonstrates some accuracy issues with
  summations using a single accumulator

  * Section 2.1.2: `fma.cuf` demonstrates how to verify if a fused
  multiply-add (FMA) is used

  * Section 2.2.1: `print.cuf` shows how to print from device code

  * Section 2.2.2: `debug.cuf` is used for debugging with `cuda-gdb`

  * Section 2.2.3: `memcheck.cuf` and `initcheck.cuf` demonstrate how the
  `compute-sanitizer` can be used to check for out-of-bounds
  and initialization errors

### Chapter 3: Performance Measurements and Metrics

  * Section 3.1.2: `events.cuf` demonstrates how to use CUDA events to
  time kernel execution

  * Section 3.1.3: `multidim.cuf` is used to demonstrate profiling by the
  Nsight Systems command-line interface `nsys`

  * Section 3.1.4.1: `nvtxBasic.cuf` demonstrates use of the basic NVTX tooling
  interfaces

  * Section 3.1.4.2: `nvtxAdv.cuf` and `nvtxAdv2.cuf` demonstrate use of the
  advanced NVTX tooling interfaces

  * Section 3.1.4.3: `nvtxAuto.cuf` is used to show how NVTX ranges can be
  automatically generated without modification of source code (see Makefile)

  * Section 3.2: `limitingFactor.cuf` is used to show how kernels can be modified
  to determine performance limiting factors (instruction vs. memory)

  * Section 3.3.1: `peakBandwidth.cuf` uses device management API routines to
  determine the theoretical peak bandwidth

  * Section 3.3.2: `effectiveBandwidth.cuf` uses a simply copy kernel to calculate
  a representative achievable bandwidth
    
### Chapter 4: Synchronization

  * Section 4.1.2.1: `twoKernels.cuf` demonstrates synchronization characteristics
  of kernels run in different streams

  * Section 4.1.3: `pipeline.cuf` demonstrates overlapping data transfers and kernel
  execution

  * Section 4.1.4.2: `streamSync.cuf` demonstrates use of `cudaStreamSycnhronize()`

  * Section 4.1.4.3: `eventSync.cuf` demonstrates use of `cudaEventSycnhronize()`

  * Section 4.1.5.1: `defaultStream.cuf`, `defaultStreamVar.cuf`,
  and `defaultStreamVarExplicit.cuf` show how to set the default stream used
  for kernel launches, data transfers, and reduction operations

  * Section 4.1.5.2: `differentStreamTypes.cuf` and `concurrentKernels.cuf` demonstrate
  characteristics of non-blocking streams

  * Section 4.2.1: `sharedExample.cuf` demonstrates use of static and dynamic shared memory,
  ``sharedMultiple.cuf` shows how offsets are used when multiple dynamic shared memory arrays
  are declared as assumed size arrays

  * Section 4.2.2: `syncthreads.cuf` demonstrates use of `syncthreads_*()` variants

  * Section 4.2.3: `ballot.cuf` demonstrates use of the warp ballot functions

  * Section 4.2.3.1: `shfl.cuf` demonstrates use of the warp shuffle function `__shfl_xor()`

  * Section 4.2.4: `raceAndAtomic.cuf` and `raceAndAtomicShared.cuf` demonstrate
  how atomic operations can be used to avoid race conditions when modifying global
  and shared memory

  * Section 4.2.5: `threadfence.cuf` demonstrates how `threadfence()` is used to order
  memory accesses

  * Section 4.2.6: `cgReverse.cuf` is an cooperative group version of the
  `sharedExample.cuf` code from section 4.2.1

  * Section 4.2.6.1: `smooth.cuf` demonstrate use of grid synchronization via cooperative
  groups

  * Section 4.2.6.2: `swap.cuf` demonstrates how to use distributred shared memory via
  thread block clusters

  
### Chapter 5: Optimization

  * Section 5.1.1: `HDtransfer.cuf` shows performance of data transfers between host
  and device using pageable and pinned host memory, `sliceTransfer.cuf` shows (when
  profiled with `nsys`) that multiple transfers of array slices can be mapped to a single
  `cudaMemcpy2D()` call, and `async.cuf` demonstrates piplining of data transfers and
  kernel execution in different streams to achieve overlap

  * Section 5.2.2.1: `assumedShapeSize.cuf` shows (when compiled with `-gpu=ptxinfo`)
  how assumed-shape array declaration of kernel arguments results in large register
  useage relative to assume-size declarations

  * Section 5.2.2.2: `stide.cuf` and `offset.cuf` are used to determine the effective
  bandwidth of accessing global data with various strides and offsets

  * Section 5.2.3: `local.cuf` shows how to check for local memory usage

  * Section 5.2.4: `constant.cuf` and `constantAttribute.cuf` demonstrate use and
  verification of user-allocated constant memory

  * Section 5.2.5: `loads.cuf` demonstrates caching behavior of loads from global memory

  * Section 5.2.6.1: `maxSharedMemory.cuf` shows how to reserved the maximum amount of
  shared memory allowable

  * Section 5.2.6.2: `transpose.cuf` uses a progressive sequence of kernels to show
  the benefits of various shared-memory optimization strategies when performing a
  matrix transpose

  * Section 5.2.7: `spill.cuf` demonstrates the use of the `launch_bounds()` attribute

  * Section 5.3.1: `parallelism.cuf` demonstrates how the execution configuration and
  occupancy affect performance

  * Section 5.3.2.1: `parallelismPipeline.cuf` demonstrates asynchronous transfers between
  global and shared memory using the pipeline primitives interface

  * Section 5.3.2.2: `cufILP.cuf` demonstrates how to achieve instruction-level
  parallelism in CUF kernels

  * Section 5.4.1.4: `fma.cuf` is used to demonstrate how `-gpu=[no]fma` is used to
  contol use of fused multiply-add instructions

### Chapter 6: Porting Tips and Techniques

  * Section 6.1: `portingBase.f90` is a host code ported to CUDA using managed memory
  (`portingManaged.cuf`) and global memory (`portindDevice.cuf`)

  * Section 6.2: Condition inclusion of code using the predefined symbol `_CUDA`
  (`portingManaged_CUDA.F90`, `portingDevice_CUDA.F90`) and the `!@cuf` sentinel
  (`portingManagedSent.F90`, `portingDeviceSent.F90`)

  * Section 6.3.1-2: Porting of `laplace2D.f90` code via variable ranaming via
  `use` statements (`laplace2DUse.F90`) and via `associate` blocks
  (`portingAssociate.f90`, `laplace2DAssoc.f90`)

  * Section 6.4: The module `union_m.cuf` contains a C-like union for
  reduction of global memory footprint of work arrays
  
  * Section 6.5: The modules `compact_m.cuf` and the optimized `compactOpt_m.cuf` contain
  routines for array compaction
  
### Chapter 7: Interfacing with CUDA C Code and CUDA Libraries

  * Section 7.1: `callingC.cuf` shows how to interface CUDA Fortran with CUDA C routines
  in `c.cu`

  * Sections 7.2.1-2: `sgemmLegacy.cuf` and `sgemmNew.cuf` demonstrate how to interface with
  cuBLAS library using the legacy and new cuBLAS APIs

  * Section 7.2.3: `getrfBatched.cuf` shows how to interface with batched cuBLAS routines

  * Section 7.2.4: `gemmPerf.cuf` shows how to opt in to using the TF32 format and tensor cores
  for matrix mutiplication

  * Section 7.3: `cusparseMV.cuf` and `cusparseMV_ex.cuf` demonstrate use of the cuSPARSE library

  * Section 7.4: `potr.cuf` demonstrates use of the cuSOLVER library

  * Section 7.5: `matmulTC.cuf` and `matmulTranspose.cuf` demonstrate
  use of the tensor core library through an overloaded `matmul()` routine as well as
  through the cuBLAS interfaces through the use of the `cutensorEx` module

  * Section 7.5.1: `cutensorContraction.cuf` illustrates use of the low-level cuTENSOR
  interfaces

  * Section 7.6: `testSort.cuf` use interfaces to the Thrust C++ template library to
  sort an array

### Chapter 8: MultiGPU-Programming

  * Section 8.1: `minimal.cuf` shos how to select, and allocate global memory on,
  different devices at runtime

  * Section 8.1.1.1: `p2pAccess.cuf` shows how to check for peer-to-peer access
  between devices

  * Section 8.1.2: `directTransfer.cuf` show how to transfer data between global memory
  on different devices without staging through the host memory, `p2pBandwidth.cuf` measures
  the bandwidth of transfers between GPUs

  * Section 8.1.3: `transposeP2P.cuf` performs a distributed transpose using P2P transfers

  * Section 8.2.1: `mpiDevices.cuf` shows how MPI ranks are mapped to devices based on
  the compute mode, and `assignDevice.cuf` shows how to ensure each MPI rank maps to
  a different device regardless of the compute mode setting, through a routine in the
  `mpiDeviceUtil.cuf` module

  * Section 8.2.2: `transposeMPI.cuf` and `transposeCAMPI.cuf` are MPI and
  CUDA-aware MPI versions of the distributed transpose (similar to the P2P transpose
  performed in Section 8.1.3)


## Part 2: Case Studies

### Chapter 9: Monte Carlo Method

  * Section 9.1: `generate_randomnumbers.cuf` demonstrates use of the CURAND library to
  generate random numbers

  * Section 9.2: `compute_pi.cuf` computes pi using the Monte Carlo technique

  * Section 9.2.1: `ieee_accuracy.f90` is used to illustrate accuracy issues
  related to FMA

  * Section 9.3: `pi_performance.CUF` measures performance of the pi calculation
  using shared memory, shuffle, atomic locks, and cooperative group kernels

  * Section 9.3.1: `shflExample.cuf` demonstrates use of the warp shuffle instructions

  * Section 9.3.3: `testPiGridGroup.cuf` shows how to use the grid_group cooperative
  group to perform reductions

  * Section 9.4: `accuracy_sum.cuf` demonstrate issues encountered with accuracy
  of summations

  * Section 9.5: `montecarlo_european_option.cuf` uses Monte Carlo methods to
  price European options

### Chapter 10: Finite Difference Method

  * Section 10.1: `finiteDifference.cuf` calculates a numerical derivatives using a
  nine-point stencil

  * Section 10.1.2: `limitingFactor.cuf` uses modified derivative kernels to
  isolate the limiting factor

  * Section 10.1.4: `finiteDifferenceStr.cuf` calculated derivatives on
  non-uniform grid

  * Section 10.2: `laplace2D.cuf` is a finite difference solution ot the
  2D Laplace equation

### Chapter 11: Applications of the Fast Fourier Transform

  * Section 11.1: `fft_test_c2c.cuf` and `fft_test_r2c.cuf` demonstrate
  use of the CUFFT library

  * Section 11.2: `fft_derivative.cuf` demonstrates use of the CUFFT routines
  to calculate derivatives

  * Section 11.3: `exampleOverlapFFT.cuf` performs a convolution via FFTs

  * Section 11.4: `ns2d.cuf` is a vortex simulation using FFTs
 
### Chapter 12: Ray Tracing

  * Section 12.1: `ppmExample.f90` generates a simple PPM file, the format used
  for images in this chapter

  * Section 12.2: `rgb_m.F90` contains the RGB derived type and overloaded
  operations

  * Section 12.3: `ray.F90` uses the ray derived type in the first ray tracing
  code

  * Section 12.4: `sphere.F90` shows how intersections of rays with a sphere
  are calculated

  * Section 12.5: `normal.F90` calculates surface normals, and `twoSpheres.F90`
  accommodates mutiple objects

  * Section 12.6: `antialias.F90` shows how multiple rays per pixel are used
  in antialiasing

  * Section 12.7.1: `diffuse.F90` generates an image of a sphere with a Lambertian
  or diffuse surface

  * Section 12.7.2: `metal.F90` generates an image of a metalic and diffuse
  spheres

  * Section 12.7.3: `dielectric.F90` generates an image with glass, metal, and
  diffuse spheres

  * Section 12.8: `camera.F90` implements a positionable camera

  * Section 12.9: `defocusBlur.F90` implements focal length effects

  * Section 12.10: `cover.F90` generates a scene with many spheres

  * Section 12.11: `triangle.F90` implements triangular objects

  * Section 12.12: `lights.f90` implements lighted objects

  * Section 12.13: `texture.F90` implements a textured surface