# FastBlur

**Repository Path**: Whistlle/FastBlur

## Basic Information

- **Project Name**: FastBlur
- **Description**: Implement some fast blur algorithms on GPU and benchmark using Unity
- **Primary Language**: C#
- **License**: WTFPL
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2018-06-19
- **Last Updated**: 2020-12-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Final Project

## 1. Topic

The topic I selected is *An investigation of fast real-time GPU-based image blur algorithms*. It is published in the intel.com(https://software.intel.com/en-us/blogs/2014/07/15/an-investigation-of-fast-real-time-gpu-based-image-blur-algorithms).

## 2. Introduction & Motivations

Image blur filters are commonly used in computer graphics and image processing– whether it is an integral part of a Depth of Field or HDR Bloom, or another post process effect, blur filters are present in most 3D game engines and often in multiple places.

The following is the images of the Bloom and Depth of Field effect, the approach of these effect are not the important of this topic, but they both use blur filters:

![bloom](image/bloom.jpg)

![dof](image/dof.jpg)  

My research topic is graphics rendering on the mobile and web platform. So what I focus on is not only the rendering effect but also the performance of the algorithm on the low-end platform.

When I implement the blur filter with the native solution on android. The application runs a low FPS. So I want to find a more optimal solution which fully uses mobile GPUS to increase its preformance. Finally I find this paper in the intel blog.

This paper includes three algorithms:

- Classic convolution blur using Gaussian distribution
- A generalization of a Kawase Bloom
- A compute shader based moving average algorithm

The following report I will firstly introduce the three algorithm, then give my implemention and benchmark of each algorithm on different platform. Finally I make the conclusion of my work and the future work.

## 3. Blur Algorithms

### 3.1 2D Gaussian blur filter

The Gaussian blur filter is a widely used filter. The Image is blurred using a Gaussian curve:

![curve](image/GauCurve.png)  

#### 3.1.1 Native Implementatioon

The simpliest way to implement is to do a convolution kernel. This simply means that we build a 2D matrix of normalized pre-calculated Gaussian function values and use them as weights, and then do a weighted sum of all neighboring pixels in order to calculate each pixel’s new (filtered) value.

Here is a 7x7 convolution kernel:

![7x7kernel](image/7x7.png)  

It is obvioisly cost a lot. For every pixel, I will call the hardware to sample the texture color for 7x7=49 times. So the algorithm complexity is *O(n^2)*(where n is the kernel length).

#### 3.1.2 Separable Horizontal / Vertical

The convolution kernel can be expressed as the outer product of two vectors:

![separable](image/separable.png)

So filtering an M-by-N image with a P-by-Q filter kernel requires roughly MNPQ multiplies and adds.

If the kernel is separable, we can filter the image in two pass:

- The first step requires about MNP multiplies and adds.

- The second requires about MNQ multiplies and adds, for a total of MN(P + Q).

So the complexity for each pixel is *O(n)*.
And the computational advantage of separable convolution versus nonseparable convolution is therefore:

$P^2/(P+Q)$

#### 3.1.3 Separable With GPU Sampling

We can take advantage of fixed function GPU hardware, namely samplers, which can load two (in our case) neighboring pixel values and return an interpolated result based on the provided texture coordinate values, all for approximately the cost of one texture read. This roughly halves the number of shader instructions (halves the number of sampling instructions and slightly increases the number of arithmetic instructions) and can increase performance by a factor of two (or less, if we are memory bandwidth limited).

Here is the image of 4 texture samples used to get 7 sample separable filter, with arrows marking offsets:

![gpuSeparable](image/gpuSeparable.png)

### 3.2 "Kawase blur" filter

This algorithm is introduced in  GDC2003 presentation “Frame Buffer Postprocessing Effects in DOUBLE-S.T.E.A.L (Wreckless)” by Masaki Kawase.

It is a multi-pass filter where each pass uses results from the previous one and applies small amount of blur, accumulating enough to approximate Gauss distribution. In each pass, the new pixel value is the average of 16 samples in a rectangular pattern at variable distance from the pixel center. The number of passes and sampling distance from the center varies depending on the expected result.

The following example is a 5-pass Kawase blur with 0, 1, 2, 2, 3 kernels, matching in output a 35x35 Gauss blur kernel. Marked in blue is 1 sample covering 4 pixels.

![kawase blur](image/Kawase.png)

Here is the flow chart of the algorithm, using 5-pass Kawase(above) as example:

![FlowChart](image/kawaseFlow.jpg)

This approach fully utilizes GPU Sampler hardware (*Bilinear*) to get four pixels with one sample call (4 samples / 16 pixel values per pass). The number of passes needed to achieve close equivalency to a specific Gauss distribution kernel size increases less than linearly with the Gauss kernel size.

### 3.3 Moving averages compute shader

Moving averages blur is a simple, linear time per pixel blur filter ( with O( 1 ) complexity), by applying multiple passes of a “moving averages” box filter.

Moving averages basically does one pass over the whole column (or row), iterating over the whole length:

1. For the first kernel size elements, a sum is created and averages are written into the output.

2. For the each new element, a value of the element no longer covered by the kernel size is subtracted from the average, and the value to the right is added.

3. Averaged sum (divided by kernel size) is then written to the output.

The following image shows a 'moving' kernel. It is moving through each pixel and then calculate the result:

![Moving Kernel](image/moving.png)

It is an interesting approach for the kernel size has *no business* to the complexity. It is *O(n)* in total!(here n is the resolution).

## 4. Implementation

I implement the three algorithms as on Unity game engine, using Shader and Compute Shader technology. And output it to both Windows(desktop) and Android(mobile).

I implement the blur Shader as a postprocessing effect, here is the base class of the PostProcessing, including a default downsample pass:

```csharp
public class BlurEffectBase : MonoBehaviour
{
    public bool UseDownSample;
    [ShowIf("UseDownSample")]
    public int DownSampleScale;
    [ShowIf("UseDownSample")]
    public Material DownSampleMaterial;
    [ShowIf("UseDownSample")]
    public int DownSamplePass;
    protected void Downsample1OverScale(RenderTexture sourceRT, out RenderTexture downSampleRT)
    {
        downSampleRT = RenderTexture.GetTemporary((sourceRT.width+ DownSampleScale )/ DownSampleScale, (sourceRT.height+ DownSampleScale )/ DownSampleScale);
        downSampleRT.enableRandomWrite = true;
        downSampleRT.Create();
        if (DownSampleMaterial != null)
        {
            Graphics.Blit(sourceRT, downSampleRT, DownSampleMaterial, DownSamplePass);
        }
        else
        {
            Graphics.Blit(sourceRT, downSampleRT);
        }
    }
}
```

### 4.1 Separable Gaussian blur Shader Generator

In order to generate separable Gaussian blur shader code automatically, I wrote a shader code generator:

```csharp
    List<double> GenerateSeparableGaussKernel(double sigma, int kernelSize)
    {
        if ((kernelSize % 2) != 1)
        {
            Debug.Assert(false); // kernel size must be odd number
            return new List<double>();
        }

        int halfKernelSize = kernelSize / 2;

        List<double> kernel = new List<double>(kernelSize);
        kernel.AddRange(Enumerable.Repeat<double>(0, kernelSize));
        const double cPI = 3.14159265358979323846;
        double mean = halfKernelSize;
        double sum = 0.0;
        for (int x = 0; x < kernelSize; ++x)
        {
            kernel[x] = (float) Math.Sqrt(
                Math.Exp(-0.5 * (Math.Pow((x - mean) / sigma, 2.0) + Math.Pow((mean) / sigma, 2.0)))
                / (2 * cPI * sigma * sigma));
            sum += kernel[x];
        }

        for (int x = 0; x < kernelSize; ++x)
            kernel[x] /= (float) sum;

        return kernel;
    }

    List<float> GetAppropriateSeparableGauss(int kernelSize)
    {
        if ((kernelSize % 2) != 1)
        {
            Debug.Assert(false); // kernel size must be odd number
            return new List<float>();
        }

        // Search for sigma to cover the whole kernel size with sensible values (might not be ideal for all cases quality-wise but is good enough for performance testing)
        double epsilon = 2e-2f / kernelSize;
        double searchStep = 1.0;
        double sigma = 1.0;
        while (true)
        {

            List<double> kernelAttempt = GenerateSeparableGaussKernel(sigma, kernelSize);
            if (kernelAttempt[0] > epsilon)
            {
                if (searchStep > 0.02)
                {
                    sigma -= searchStep;
                    searchStep *= 0.1;
                    sigma += searchStep;
                    continue;
                }

                List<float> retVal = new List<float>();
                for (int i = 0; i < kernelSize; i++)
                    retVal.Add((float) kernelAttempt[i]);
                return retVal;
            }

            sigma += searchStep;

            if (sigma > 1000.0)
            {
                Debug.Assert(false); // not tested, preventing infinite loop
            }
        }

        return new List<float>();
    }
```

The output code:

```glsl
// automatically generated by GenerateGaussFunctionCode in GaussianBlur.h
float3 GaussianBlur( sampler2D tex0, float2 centreUV, float2 halfPixelOffset, float2 pixelOffset )
{
    float3 colOut = float3( 0, 0, 0 );

    //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////;
    // Kernel width 47 x 47
    //
    const int stepCount = 12;
    //
    const float gWeights[stepCount] ={
       0.08053,
       0.10183,
       0.08965,
       0.07339,
       0.05586,
       0.03953,
       0.02601,
       0.01592,
       0.00905,
       0.00479,
       0.00236,
       0.00108
    };
    const float gOffsets[stepCount] ={
       0.66463,
       2.48858,
       4.47945,
       6.47033,
       8.46123,
       10.45216,
       12.44312,
       14.43412,
       16.42516,
       18.41625,
       20.40740,
       22.39860
    };
    //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////;

    for( int i = 0; i < stepCount; i++ )
    {
        float2 texCoordOffset = gOffsets[i] * pixelOffset;
        float3 col = tex2D( tex0, centreUV + texCoordOffset ).xyz + tex2D( tex0, centreUV - texCoordOffset ).xyz;
        colOut += gWeights[i] * col;
    }
    return colOut;
}
```

### 4.2 Kawase

First I define a dictionary to link the kernel pass to Kawase pass:

```csharp
    public Dictionary<int, int[]> _kawasePass = new Dictionary<int, int[]>()
    {
        {7, new[] {0, 0}},
        {15, new[] {0, 1, 1}},
        {23, new[] {0, 1, 1, 2}},
        {35, new[] {0, 1, 2, 2, 3}},
        {63, new[] {0, 1, 2, 3, 4, 4, 5}},
        {127, new[] {0, 1, 2, 3, 4, 5, 7, 8, 9, 10}}
    };
```

Then a for loop to do the main process:

```csharp
    RenderTexture rt = RenderTexture.GetTemporary(downSampleRT.width, downSampleRT.height, 0, sourceTexture.format);

    RenderTexture tmp;
    for (int i = 0; i < _kawasePass[Test.KernelSizes].Length; ++i)
    {
        BlurMaterial.SetFloat("_FilterPass", _kawasePass[Test.KernelSizes][i]);
        Graphics.Blit(downSampleRT, rt, BlurMaterial);
        tmp = downSampleRT;
        downSampleRT = rt;
        rt = tmp;
    }
```

In Shader the GPU process every pixel of the following code like the Chart Flow above in 3.2:

```csharp
    half4 KawaseBlur(sampler2D tex, float2 uv, float2 texelSize, float filterPass)
    {
        float2 o =  texelSize.xy/2 +  filterPass * texelSize.xy;
        half4 s;
        s =  (tex2D(tex, uv + float2(-o.x, -o.y)));
        s += (tex2D(tex, uv + float2(-o.x, o.y)));
        s += (tex2D(tex, uv + float2(o.x, o.y)));
        s += (tex2D(tex, uv + float2(o.x, -o.y)));
        return s * (1.0 / 4.0);
    }
```

### 4.3 Moving Average

Moving Average uses compute shader. 

Compute shaders are programs that run on the graphics card, outside of the normal rendering pipeline. They can be used for massively parallel GPGPU algorithms, or to accelerate parts of game rendering.

Here is the process code in csharp:

```csharp
    var cRTScreenSizeI = new int[] {sourceTexture.width, sourceTexture.height, rtW, rtH};
    MovingAveragesCS.SetInts("cRTScreenSizeI", cRTScreenSizeI);
    MovingAveragesCS.SetInt("cKernelSize", Test.KernelSizes);
    MovingAveragesCS.SetInt("cKernelHalfDist", Test.KernelSizes/2);
    MovingAveragesCS.SetFloat("recKernelSize", 1.0f/ Test.KernelSizes);

    // Horizontal
    {
        MovingAveragesCS.SetTexture(computeH, "RT0", downSampleRT);
        MovingAveragesCS.SetTexture(computeH, "RT1", rt1);
        MovingAveragesCS.Dispatch(computeH, (rtH + 32 - 1) / 32, 1, 1);

        MovingAveragesCS.SetTexture(computeH, "RT1", downSampleRT);
        MovingAveragesCS.SetTexture(computeH, "RT0", rt1);
        MovingAveragesCS.Dispatch(computeH, (rtH + 32 - 1) / 32, 1, 1);
    }

    // Vertical
    {
        MovingAveragesCS.SetTexture(computeV, "RT0", downSampleRT);
        MovingAveragesCS.SetTexture(computeV, "RT1", rt1);
        MovingAveragesCS.Dispatch(computeV, (rtW + 32 - 1) / 32, 1, 1);

        MovingAveragesCS.SetTexture(computeV, "RT1", downSampleRT);
        MovingAveragesCS.SetTexture(computeV, "RT0", rt1);
        MovingAveragesCS.Dispatch(computeV, (rtW + 32 - 1) / 32, 1, 1);
    }
```

Synax in compute shader is different from common vertex-pixel shader. Here is the code in compute shader, to do the vertical pass, just reverse the (y,x) to the code:

```glsl
[numthreads(32,1,1)]
void ComputeH (uint3 id : SV_DispatchThreadID)
{
	// TODO: insert actual code here!
	int y = id.x;

	// avoid processing pixels that are out of texture dimensions!
	if (y >= (cRTScreenSizeI.w)) return;

	float3 colourSum = RT0[int2(0, y)].xyz * float(cKernelHalfDist);
	for (int x = 0; x <= cKernelHalfDist; x++)
		colourSum += RT0[int2(x, y)].xyz;

	for (int x = 0; x < cRTScreenSizeI.z; x++)
	{
		RT1[int2(x, y)] =  float4(colourSum * recKernelSize, 1.0);

		// move window to the next 
		float3 leftBorder = RT0[int2(max(x - cKernelHalfDist, 0), y)].xyz;
		float3 rightBorder = RT0[int2(min(x + cKernelHalfDist + 1, cRTScreenSizeI.z - 1), y)].xyz;

		colourSum -= leftBorder;
		colourSum += rightBorder;
	}
}
```

## 5. Result & Benchmark

I compared these algorithms to see the both effect and performance:

![Compare](image/gauVSKaw.png)

The left one is the Gaussian Blur and the right one is Kawase blur, using 35x35 kernel. The Gaussian Blur runs only *3.5fps* while Kawase is *101fps*, very smoothly. This two pictures looks almost the same, proves that the Kawase Blur filter can have a better effect while cost lower.

I write some code to create the benchmark to to test each algorithms on different platforms. The approach is simple:

1. Switch foreach every blur effect algorithm

2. Calculate the average used time of 20 frames

I test these approaches in several platforms,
including:

- laptop with integrated graphics(Intel UHD 620)
- laptop with discrete graphics(gtx1050)
- mobile phone(android, Qualcomm820)

The total project can be seen in <https://gitee.com/Whistlle/FastBlur>

I write some scripts to generate test items and auto switch the effect to test. Here is the image of the benchmark items to test:

![benchmark](image/benchmarkItem.png)

Here is the final output of the specified benchmark items:

![result](image/result.png)

Where the text at the left-bottom is the resolution of the platform, the right text is the output log of each test item.

Here is my benchmark result:

*Standard and Separable Gaussian Blur(with 766\*431 resolution):*

![std&sep](image/standard&sep.png)

*Kawase Blur:*

![kawase](image/kawaseTest.png)

*Moving Average:*

![movAvg](image/movAvg.png)

## 6. Conslusion & Future Work

In the final project, I study and implement three blur algorithms and test each effect on different platforms. I will integrated these code into my project to optimal the performance of the postprocess effect.

Even the last optimal algorithms approximates Gauss distribution. I haven't give the certain quantifiable data how they are similiar to the reference Image(Gauss filter). So I can use the IQA(Image Quality Assessment) approach to give the answer.