master

分支 (1)

管理

管理

master

CenterNet.pytorch
/
dcn.patch

diff --git a/src/lib/models/networks/DCNv2/.gitignore b/src/lib/models/networks/DCNv2/.gitignore
index b1e9421..5891c61 100644
--- a/src/lib/models/networks/DCNv2/.gitignore
+++ b/src/lib/models/networks/DCNv2/.gitignore
@@ -3,4 +3,9 @@
 *.so
 *.o
 *pyc
-_ext
\ No newline at end of file
+_ext
+build
+DCNv2.egg-info
+dist
+vendor/
+
diff --git a/src/lib/models/networks/DCNv2/LICENSE b/src/lib/models/networks/DCNv2/LICENSE
index cd31b28..b2e3b52 100644
--- a/src/lib/models/networks/DCNv2/LICENSE
+++ b/src/lib/models/networks/DCNv2/LICENSE
@@ -26,4 +26,4 @@ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
diff --git a/src/lib/models/networks/DCNv2/README.md b/src/lib/models/networks/DCNv2/README.md
index 0ddcf18..0c0ea79 100644
--- a/src/lib/models/networks/DCNv2/README.md
+++ b/src/lib/models/networks/DCNv2/README.md
@@ -1,60 +1,24 @@
-## Deformable Convolutional Networks V2 with Pytorch
-
-### Build
-```bash
-    ./make.sh         # build
-    python test.py    # run examples and gradient check
-```
-
-### An Example
-- deformable conv
-```python
-    from dcn_v2 import DCN
-    input = torch.randn(2, 64, 128, 128).cuda()
-    # wrap all things (offset and mask) in DCN
-    dcn = DCN(64, 64, kernel_size=(3,3), stride=1, padding=1, deformable_groups=2).cuda()
-    output = dcn(input)
-    print(output.shape)
-```
-- deformable roi pooling
-```python
-    from dcn_v2 import DCNPooling
-    input = torch.randn(2, 32, 64, 64).cuda()
-    batch_inds = torch.randint(2, (20, 1)).cuda().float()
-    x = torch.randint(256, (20, 1)).cuda().float()
-    y = torch.randint(256, (20, 1)).cuda().float()
-    w = torch.randint(64, (20, 1)).cuda().float()
-    h = torch.randint(64, (20, 1)).cuda().float()
-    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)
-
-    # mdformable pooling (V2)
-    # wrap all things (offset and mask) in DCNPooling
-    dpooling = DCNPooling(spatial_scale=1.0 / 4,
-                         pooled_size=7,
-                         output_dim=32,
-                         no_trans=False,
-                         group_size=1,
-                         trans_std=0.1).cuda()
-
-    dout = dpooling(input, rois)
-```
-
-### Known Issues:
-
-- [x] Gradient check w.r.t offset (solved)
-- [ ] Backward is not reentrant (minor)
-
-This is an adaption of the official [Deformable-ConvNets](https://github.com/msracver/Deformable-ConvNets/tree/master/DCNv2_op).
-
-<s>I have ran the gradient check for many times with DOUBLE type. Every tensor **except offset** passes.
-However, when I set the offset to 0.5, it passes. I'm still wondering what cause this problem. Is it because some
-non-differential points? </s>
-
-Update: all gradient check passes with double precision.
-
-Another issue is that it raises `RuntimeError: Backward is not reentrant`. However, the error is very small (`<1e-7` for
-float `<1e-15` for double),
-so it may not be a serious problem (?)
-
-Please post an issue or PR if you have any comments.
-
\ No newline at end of file
+# DCNv2 latest
+
+
+
+Since DCN is used in many models and performance well but in industry this op support is not very well. Including pytorch, onnx, tensorrt etc. This repo is make DCNv2 available at all versions in pytorch.
+
+![image-20210217183103121](https://gitee.com/jinfagang/picbed/raw/master/img/image-20210217183103121.png)
+
+Pytorch 1.7 inferenced in CenterNet-DLA model. It works on Pytorch 1.7 so that you can use it in your RTX 30 series cards.
+
+
+
+## Updates
+
+- **2021.03.24**: It was confirmed PyTorch 1.8 is OK with master branch, feel free to use it.
+- **2021.02.18**: Happy new year! PyTorch 1.7 finally supported on master branch! **for lower version theoretically also works, if not, pls fire an issue to me!**.
+- **2020.09.23**: Now master branch works for pytorch 1.6 by default, for older version you gonna need separated one.
+- **2020.08.25**: Check out pytorch1.6 branch for pytorch 1.6 support, you will meet an error like `THCudaBlas_Sgemv undefined` if you using pytorch 1.6 build master branch. master branch now work for pytorch 1.5;
+
+
+
+## Contact
+
+If you have any question, please using this platform post questions: http://t.manaai.cn
diff --git a/src/lib/models/networks/DCNv2/dcn_v2.py b/src/lib/models/networks/DCNv2/dcn_v2.py
index e1bb700..14a2813 100644
--- a/src/lib/models/networks/DCNv2/dcn_v2.py
+++ b/src/lib/models/networks/DCNv2/dcn_v2.py
@@ -1,27 +1,121 @@
 #!/usr/bin/env python
-from __future__ import absolute_import
-from __future__ import print_function
-from __future__ import division
+from __future__ import absolute_import, division, print_function
+       i

-import torch
 import math
+
+import torch
 from torch import nn
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
 from torch.nn.modules.utils import _pair

-from .dcn_v2_func import DCNv2Function
-from .dcn_v2_func import DCNv2PoolingFunction
+import _ext as _backend

-class DCNv2(nn.Module):

-    def __init__(self, in_channels, out_channels,
-                 kernel_size, stride, padding, dilation=1, deformable_groups=1):
+class _DCNv2(Function):
+    @staticmethod
+    def forward(
+        ctx, input, weight, offset, bias, stride, padding, dilation, deformable_groups
+    ):
+        ctx.stride = _pair(stride)
+        ctx.padding = _pair(padding)
+        ctx.dilation = _pair(dilation)
+        ctx.kernel_size = _pair(weight.shape[2:4])
+        ctx.deformable_groups = deformable_groups
+        #import pdb;pdb.set_trace()
+        mask = offset[:, 18:27, :, :]
+        output = _backend.dcn_v2_forward(
+            input,
+            weight,
+            bias,
+            offset,
+            mask,
+            ctx.kernel_size[0],
+            ctx.kernel_size[1],
+            ctx.stride[0],
+            ctx.stride[1],
+            ctx.padding[0],
+            ctx.padding[1],
+            ctx.dilation[0],
+            ctx.dilation[1],
+            ctx.deformable_groups,
+        )
+        ctx.save_for_backward(input, offset, mask, weight, bias)
+        return output
+
+    @staticmethod
+    @once_differentiable
+    def backward(ctx, grad_output):
+        input, offset, mask, weight, bias = ctx.saved_tensors
+        grad_input, grad_offset, grad_mask, grad_weight, grad_bias = _backend.dcn_v2_backward(
+            input,
+            weight,
+            bias,
+            offset,
+            mask,
+            grad_output,
+            ctx.kernel_size[0],
+            ctx.kernel_size[1],
+            ctx.stride[0],
+            ctx.stride[1],
+            ctx.padding[0],
+            ctx.padding[1],
+            ctx.dilation[0],
+            ctx.dilation[1],
+            ctx.deformable_groups,
+        )
+
+        return grad_input, grad_offset, grad_mask, grad_weight, grad_bias, None, None, None, None
+
+    @staticmethod
+    # 0730: add groups
+    def symbolic(
+        g, input, weight, offset, bias, strides, pads, dilation, deformable_groups
+    ):
+        from torch.nn.modules.utils import _pair
+
+        strides = _pair(strides)
+        pads = _pair(pads)
+        dilation = _pair(dilation)
+        # as of trt 7, the dcn operation will be translated again by modifying the onnx file
+        # so the exporting code is kept to resemble the forward()
+        return g.op(
+            "DeformableConv2D",
+            input,
+            weight,
+            offset,
+            bias,
+            strides_i=strides,
+            pads_i=pads,
+            dilations_i=dilation,
+            groups_i=1,
+            #groups_i=groups
+            deformable_groups_i=deformable_groups,
+        )
+
+
+dcn_v2_conv = _DCNv2.apply
+
+
+class DCNv2(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride,
+        padding,
+        dilation=1,
+        deformable_groups=1,
+    ):
         super(DCNv2, self).__init__()
         self.in_channels = in_channels
         self.out_channels = out_channels
         self.kernel_size = _pair(kernel_size)
-        self.stride = stride
-        self.padding = padding
-        self.dilation = dilation
+        self.stride = _pair(stride)
+        self.padding = _pair(padding)
+        self.dilation = _pair(dilation)
         self.deformable_groups = deformable_groups

         self.weight = nn.Parameter(torch.Tensor(out_channels, in_channels, *self.kernel_size))
@@ -32,29 +126,52 @@ class DCNv2(nn.Module):
         n = self.in_channels
         for k in self.kernel_size:
             n *= k
-        stdv = 1. / math.sqrt(n)
+        stdv = 1.0 / math.sqrt(n)
         self.weight.data.uniform_(-stdv, stdv)
         self.bias.data.zero_()

     def forward(self, input, offset, mask):
-        func = DCNv2Function(self.stride, self.padding, self.dilation, self.deformable_groups)
-        return func(input, offset, mask, self.weight, self.bias)
+        assert (
+            2 * self.deformable_groups * self.kernel_size[0] * self.kernel_size[1]
+            == offset.shape[1]
+        )
+        assert self.deformable_groups * self.kernel_size[0] * self.kernel_size[1] == mask.shape[1]
+        return dcn_v2_conv(
+            input,
+            self.weight,
+            offset,
+            self.bias,
+            self.stride,
+            self.padding,
+            self.dilation,
+            self.deformable_groups,
+        )


 class DCN(DCNv2):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride,
+        padding,
+        dilation=1,
+        deformable_groups=1,
+    ):
+        super(DCN, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation, deformable_groups
+        )

-    def __init__(self, in_channels, out_channels,
-                 kernel_size, stride, padding,
-                 dilation=1, deformable_groups=1):
-        super(DCN, self).__init__(in_channels, out_channels,
-                                  kernel_size, stride, padding, dilation, deformable_groups)
-
-        self.conv_offset_mask = nn.Conv2d(self.in_channels,
-                                          self.deformable_groups * 3 * self.kernel_size[0] * self.kernel_size[1],
-                                          kernel_size=self.kernel_size,
-                                          stride=(self.stride, self.stride),
-                                          padding=(self.padding, self.padding),
-                                          bias=True)
+        channels_ = self.deformable_groups * 3 * self.kernel_size[0] * self.kernel_size[1]
+        self.conv_offset_mask = nn.Conv2d(
+            self.in_channels,
+            channels_,
+            kernel_size=self.kernel_size,
+            stride=self.stride,
+            padding=self.padding,
+            bias=True,
+        )
         self.init_offset()

     def init_offset(self):
@@ -66,21 +183,104 @@ class DCN(DCNv2):
         o1, o2, mask = torch.chunk(out, 3, dim=1)
         offset = torch.cat((o1, o2), dim=1)
         mask = torch.sigmoid(mask)
-        func = DCNv2Function(self.stride, self.padding, self.dilation, self.deformable_groups)
-        return func(input, offset, mask, self.weight, self.bias)
+        offset_x = offset.reshape(offset.shape[0], -1, 2, offset.shape[2].numpy(), offset.shape[3].numpy()) \
+                         [:,:,1,...].reshape(offset.shape[0],offset.shape[1].numpy() // 2,
+                                             offset.shape[2],offset.shape[3].numpy())
+        offset_y = offset.reshape(offset.shape[0], -1, 2, offset.shape[2].numpy(), offset.shape[3].numpy()) \
+                         [:,:,0,...].reshape(offset.shape[0],offset.shape[1].numpy() // 2,
+                                             offset.shape[2],offset.shape[3].numpy())
+        offset = torch.cat((offset_x,offset_y,mask),1)
+        return dcn_v2_conv(
+            input,
+            self.weight,
+            offset,
+            self.bias,
+            self.stride,
+            self.padding,
+            self.dilation,
+            self.deformable_groups,
+        )


-class DCNv2Pooling(nn.Module):
+class _DCNv2Pooling(Function):
+    @staticmethod
+    def forward(
+        ctx,
+        input,
+        rois,
+        offset,
+        spatial_scale,
+        pooled_size,
+        output_dim,
+        no_trans,
+        group_size=1,
+        part_size=None,
+        sample_per_part=4,
+        trans_std=0.0,
+    ):
+        ctx.spatial_scale = spatial_scale
+        ctx.no_trans = int(no_trans)
+        ctx.output_dim = output_dim
+        ctx.group_size = group_size
+        ctx.pooled_size = pooled_size
+        ctx.part_size = pooled_size if part_size is None else part_size
+        ctx.sample_per_part = sample_per_part
+        ctx.trans_std = trans_std
+
+        output, output_count = _backend.dcn_v2_psroi_pooling_forward(
+            input,
+            rois,
+            offset,
+            ctx.no_trans,
+            ctx.spatial_scale,
+            ctx.output_dim,
+            ctx.group_size,
+            ctx.pooled_size,
+            ctx.part_size,
+            ctx.sample_per_part,
+            ctx.trans_std,
+        )
+        ctx.save_for_backward(input, rois, offset, output_count)
+        return output
+
+    @staticmethod
+    @once_differentiable
+    def backward(ctx, grad_output):
+        input, rois, offset, output_count = ctx.saved_tensors
+        grad_input, grad_offset = _backend.dcn_v2_psroi_pooling_backward(
+            grad_output,
+            input,
+            rois,
+            offset,
+            output_count,
+            ctx.no_trans,
+            ctx.spatial_scale,
+            ctx.output_dim,
+            ctx.group_size,
+            ctx.pooled_size,
+            ctx.part_size,
+            ctx.sample_per_part,
+            ctx.trans_std,
+        )
+
+        return grad_input, None, grad_offset, None, None, None, None, None, None, None, None

-    def __init__(self,
-                 spatial_scale,
-                 pooled_size,
-                 output_dim,
-                 no_trans,
-                 group_size=1,
-                 part_size=None,
-                 sample_per_part=4,
-                 trans_std=.0):
+
+dcn_v2_pooling = _DCNv2Pooling.apply
+
+
+class DCNv2Pooling(nn.Module):
+    def __init__(
+        self,
+        spatial_scale,
+        pooled_size,
+        output_dim,
+        no_trans,
+        group_size=1,
+        part_size=None,
+        sample_per_part=4,
+        trans_std=0.0,
+    ):
         super(DCNv2Pooling, self).__init__()
         self.spatial_scale = spatial_scale
         self.pooled_size = pooled_size
@@ -90,82 +290,121 @@ class DCNv2Pooling(nn.Module):
         self.part_size = pooled_size if part_size is None else part_size
         self.sample_per_part = sample_per_part
         self.trans_std = trans_std
-        self.func = DCNv2PoolingFunction(self.spatial_scale,
-                             self.pooled_size,
-                             self.output_dim,
-                             self.no_trans,
-                             self.group_size,
-                             self.part_size,
-                             self.sample_per_part,
-                             self.trans_std)
-
-    def forward(self, data, rois, offset):

+    def forward(self, input, rois, offset):
+        assert input.shape[1] == self.output_dim
         if self.no_trans:
-            offset = data.new()
-        return self.func(data, rois, offset)
+            offset = input.new()
+        return dcn_v2_pooling(
+            input,
+            rois,
+            offset,
+            self.spatial_scale,
+            self.pooled_size,
+            self.output_dim,
+            self.no_trans,
+            self.group_size,
+            self.part_size,
+            self.sample_per_part,
+            self.trans_std,
+        )

-class DCNPooling(DCNv2Pooling):

-    def __init__(self,
-                 spatial_scale,
-                 pooled_size,
-                 output_dim,
-                 no_trans,
-                 group_size=1,
-                 part_size=None,
-                 sample_per_part=4,
-                 trans_std=.0,
-                 deform_fc_dim=1024):
-        super(DCNPooling, self).__init__(spatial_scale,
-                                         pooled_size,
-                                         output_dim,
-                                         no_trans,
-                                         group_size,
-                                         part_size,
-                                         sample_per_part,
-                                         trans_std)
+class DCNPooling(DCNv2Pooling):
+    def __init__(
+        self,
+        spatial_scale,
+        pooled_size,
+        output_dim,
+        no_trans,
+        group_size=1,
+        part_size=None,
+        sample_per_part=4,
+        trans_std=0.0,
+        deform_fc_dim=1024,
+    ):
+        super(DCNPooling, self).__init__(
+            spatial_scale,
+            pooled_size,
+            output_dim,
+            no_trans,
+            group_size,
+            part_size,
+            sample_per_part,
+            trans_std,
+        )

         self.deform_fc_dim = deform_fc_dim

         if not no_trans:
-            self.func_offset = DCNv2PoolingFunction(self.spatial_scale,
-                                                    self.pooled_size,
-                                                    self.output_dim,
-                                                    True,
-                                                    self.group_size,
-                                                    self.part_size,
-                                                    self.sample_per_part,
-                                                    self.trans_std)
-            self.offset_fc = nn.Sequential(
-                nn.Linear(self.pooled_size * self.pooled_size * self.output_dim, self.deform_fc_dim),
+            self.offset_mask_fc = nn.Sequential(
+                nn.Linear(
+                    self.pooled_size * self.pooled_size * self.output_dim, self.deform_fc_dim
+                ),
                 nn.ReLU(inplace=True),
                 nn.Linear(self.deform_fc_dim, self.deform_fc_dim),
                 nn.ReLU(inplace=True),
-                nn.Linear(self.deform_fc_dim, self.pooled_size * self.pooled_size * 2)
-            )
-            self.offset_fc[4].weight.data.zero_()
-            self.offset_fc[4].bias.data.zero_()
-            self.mask_fc = nn.Sequential(
-                nn.Linear(self.pooled_size * self.pooled_size * self.output_dim, self.deform_fc_dim),
-                nn.ReLU(inplace=True),
-                nn.Linear(self.deform_fc_dim, self.pooled_size * self.pooled_size * 1),
-                nn.Sigmoid()
+                nn.Linear(self.deform_fc_dim, self.pooled_size * self.pooled_size * 3),
             )
-            self.mask_fc[2].weight.data.zero_()
-            self.mask_fc[2].bias.data.zero_()
+            self.offset_mask_fc[4].weight.data.zero_()
+            self.offset_mask_fc[4].bias.data.zero_()

-    def forward(self, data, rois):
-        if self.no_trans:
-            offset = data.new()
-        else:
+    def forward(self, input, rois):
+        offset = input.new()
+
+        if not self.no_trans:
+
+            # do roi_align first
             n = rois.shape[0]
-            offset = data.new()
-            x = self.func_offset(data, rois, offset)
-            offset = self.offset_fc(x.view(n, -1))
-            offset = offset.view(n, 2, self.pooled_size, self.pooled_size)
-            mask = self.mask_fc(x.view(n, -1))
-            mask = mask.view(n, 1, self.pooled_size, self.pooled_size)
-            feat = self.func(data, rois, offset) * mask
-            return feat
-        return self.func(data, rois, offset)
+            roi = dcn_v2_pooling(
+                input,
+                rois,
+                offset,
+                self.spatial_scale,
+                self.pooled_size,
+                self.output_dim,
+                True,  # no trans
+                self.group_size,
+                self.part_size,
+                self.sample_per_part,
+                self.trans_std,
+            )
+
+            # build mask and offset
+            offset_mask = self.offset_mask_fc(roi.view(n, -1))
+            offset_mask = offset_mask.view(n, 3, self.pooled_size, self.pooled_size)
+            o1, o2, mask = torch.chunk(offset_mask, 3, dim=1)
+            offset = torch.cat((o1, o2), dim=1)
+            mask = torch.sigmoid(mask)
+
+            # do pooling with offset and mask
+            return (
+                dcn_v2_pooling(
+                    input,
+                    rois,
+                    offset,
+                    self.spatial_scale,
+                    self.pooled_size,
+                    self.output_dim,
+                    self.no_trans,
+                    self.group_size,
+                    self.part_size,
+                    self.sample_per_part,
+                    self.trans_std,
+                )
+                * mask
+            )
+        # only roi_align
+        return dcn_v2_pooling(
+            input,
+            rois,
+            offset,
+            self.spatial_scale,
+            self.pooled_size,
+            self.output_dim,
+            self.no_trans,
+            self.group_size,
+            self.part_size,
+            self.sample_per_part,
+            self.trans_std,
+        )
diff --git a/src/lib/models/networks/DCNv2/make.sh b/src/lib/models/networks/DCNv2/make.sh
index d489f7c..0ed7e53 100755
--- a/src/lib/models/networks/DCNv2/make.sh
+++ b/src/lib/models/networks/DCNv2/make.sh
@@ -1,14 +1,4 @@
 #!/usr/bin/env bash
-cd src/cuda
-
-# compile dcn
-nvcc -c -o dcn_v2_im2col_cuda.cu.o dcn_v2_im2col_cuda.cu -x cu -Xcompiler -fPIC
-nvcc -c -o dcn_v2_im2col_cuda_double.cu.o dcn_v2_im2col_cuda_double.cu -x cu -Xcompiler -fPIC
-
-# compile dcn-roi-pooling
-nvcc -c -o dcn_v2_psroi_pooling_cuda.cu.o dcn_v2_psroi_pooling_cuda.cu -x cu -Xcompiler -fPIC
-nvcc -c -o dcn_v2_psroi_pooling_cuda_double.cu.o dcn_v2_psroi_pooling_cuda_double.cu -x cu -Xcompiler -fPIC
-
-cd -
-python build.py
-python build_double.py
+sudo rm *.so
+sudo rm -r build/
+python3 setup.py build develop
diff --git a/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.cu b/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.cu
index ab22b1b..4140eac 100644
--- a/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.cu
+++ b/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.cu
@@ -3,6 +3,13 @@
 #include <algorithm>
 #include <cstring>

+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+
+#include <THC/THC.h>
+#include <THC/THCAtomics.cuh>
+#include <THC/THCDeviceUtils.cuh>
+
 #define CUDA_KERNEL_LOOP(i, n)                          \
   for (int i = blockIdx.x * blockDim.x + threadIdx.x;   \
       i < (n);                                          \
@@ -15,7 +22,7 @@ inline int GET_BLOCKS(const int N)
 }


-__device__ float dmcn_im2col_bilinear(const float *bottom_data, const int data_width,
+__device__ float dmcn_im2col_bilinear_cuda(const float *bottom_data, const int data_width,
                                       const int height, const int width, float h, float w)
 {
   int h_low = floor(h);
@@ -46,7 +53,7 @@ __device__ float dmcn_im2col_bilinear(const float *bottom_data, const int data_w
   return val;
 }

-__device__ float dmcn_get_gradient_weight(float argmax_h, float argmax_w,
+__device__ float dmcn_get_gradient_weight_cuda(float argmax_h, float argmax_w,
                                           const int h, const int w, const int height, const int width)
 {
   if (argmax_h <= -1 || argmax_h >= height || argmax_w <= -1 || argmax_w >= width)
@@ -72,7 +79,7 @@ __device__ float dmcn_get_gradient_weight(float argmax_h, float argmax_w,
   return weight;
 }

-__device__ float dmcn_get_coordinate_weight(float argmax_h, float argmax_w,
+__device__ float dmcn_get_coordinate_weight_cuda(float argmax_h, float argmax_w,
                                             const int height, const int width, const float *im_data,
                                             const int data_width, const int bp_dir)
 {
@@ -126,13 +133,20 @@ __global__ void modulated_deformable_im2col_gpu_kernel(const int n,
                                                        const int height_col, const int width_col,
                                                        float *data_col)
 {
+  // launch channels * batch_size * height_col * width_col cores
   CUDA_KERNEL_LOOP(index, n)
   {
+    // NOTE(CharlesShang): different from Dai Jifeng's MXNet implementation, col_buffer is of shape (c*kw*kh, N, oh, ow)
+    // here columns is of shape (N, c*kw*kh, oh * ow), need to adapt axis
+
     // index index of output matrix
     const int w_col = index % width_col;
     const int h_col = (index / width_col) % height_col;
-    const int b_col = (index / width_col / height_col) % batch_size;
-    const int c_im = (index / width_col / height_col) / batch_size;
+    // const int b_col = (index / width_col / height_col) % batch_size;
+    const int b_col = (index / width_col / height_col / num_channels) % batch_size;
+    // const int c_im = (index / width_col / height_col) / batch_size;
+    const int c_im = (index / width_col / height_col) % num_channels;
+    // const int c_col = c_im * kernel_h * kernel_w;
     const int c_col = c_im * kernel_h * kernel_w;

     // compute deformable group index
@@ -141,7 +155,8 @@ __global__ void modulated_deformable_im2col_gpu_kernel(const int n,
     const int h_in = h_col * stride_h - pad_h;
     const int w_in = w_col * stride_w - pad_w;

-    float *data_col_ptr = data_col + ((c_col * batch_size + b_col) * height_col + h_col) * width_col + w_col;
+    //  float *data_col_ptr = data_col + ((c_col * batch_size + b_col) * height_col + h_col) * width_col + w_col;
+    float *data_col_ptr = data_col + ((b_col * num_channels * kernel_w * kernel_h + c_col) * height_col + h_col) * width_col + w_col;
     //const float* data_im_ptr = data_im + ((b_col * num_channels + c_im) * height + h_in) * width + w_in;
     const float *data_im_ptr = data_im + (b_col * num_channels + c_im) * height * width;
     const float *data_offset_ptr = data_offset + (b_col * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;
@@ -168,12 +183,12 @@ __global__ void modulated_deformable_im2col_gpu_kernel(const int n,
           //const float map_w = j * dilation_w + offset_w;
           //const int cur_height = height - h_in;
           //const int cur_width = width - w_in;
-          //val = dmcn_im2col_bilinear(data_im_ptr, width, cur_height, cur_width, map_h, map_w);
-          val = dmcn_im2col_bilinear(data_im_ptr, width, height, width, h_im, w_im);
+          //val = dmcn_im2col_bilinear_cuda(data_im_ptr, width, cur_height, cur_width, map_h, map_w);
+          val = dmcn_im2col_bilinear_cuda(data_im_ptr, width, height, width, h_im, w_im);
         }
         *data_col_ptr = val * mask;
-        data_col_ptr += batch_size * height_col * width_col;
-        //data_col_ptr += height_col * width_col;
+        // data_col_ptr += batch_size * height_col * width_col;
+        data_col_ptr += height_col * width_col;
       }
     }
   }
@@ -230,7 +245,7 @@ __global__ void modulated_deformable_col2im_gpu_kernel(const int n,
             abs(cur_inv_w_data - (cur_w + dx)) < 1)
         {
           int cur_bottom_grad_pos = ((b * channels + c) * height + cur_h + dy) * width + cur_w + dx;
-          float weight = dmcn_get_gradient_weight(cur_inv_h_data, cur_inv_w_data, cur_h + dy, cur_w + dx, height, width);
+          float weight = dmcn_get_gradient_weight_cuda(cur_inv_h_data, cur_inv_w_data, cur_h + dy, cur_w + dx, height, width);
           atomicAdd(grad_im + cur_bottom_grad_pos, weight * cur_top_grad);
         }
       }
@@ -295,9 +310,9 @@ __global__ void modulated_deformable_col2im_coord_gpu_kernel(const int n,
       }
       else
       {
-        mval += data_col_ptr[col_pos] * dmcn_im2col_bilinear(data_im_ptr + cnt * height * width, width, height, width, inv_h, inv_w);
+        mval += data_col_ptr[col_pos] * dmcn_im2col_bilinear_cuda(data_im_ptr + cnt * height * width, width, height, width, inv_h, inv_w);
       }
-      const float weight = dmcn_get_coordinate_weight(
+      const float weight = dmcn_get_coordinate_weight_cuda(
           inv_h, inv_w,
           height, width, data_im_ptr + cnt * height * width, width, bp_dir);
       val += weight * data_col_ptr[col_pos] * mask;
@@ -314,7 +329,7 @@ __global__ void modulated_deformable_col2im_coord_gpu_kernel(const int n,
 void modulated_deformable_im2col_cuda(cudaStream_t stream,
   const float* data_im, const float* data_offset, const float* data_mask,
   const int batch_size, const int channels, const int height_im, const int width_im,
-  const int height_col, const int width_col, const int kernel_h, const int kenerl_w,
+  const int height_col, const int width_col, const int kernel_h, const int kernel_w,
   const int pad_h, const int pad_w, const int stride_h, const int stride_w,
   const int dilation_h, const int dilation_w,
   const int deformable_group, float* data_col) {
@@ -324,7 +339,7 @@ void modulated_deformable_im2col_cuda(cudaStream_t stream,
   modulated_deformable_im2col_gpu_kernel
       <<<GET_BLOCKS(num_kernels), CUDA_NUM_THREADS,
           0, stream>>>(
-      num_kernels, data_im, data_offset, data_mask, height_im, width_im, kernel_h, kenerl_w,
+      num_kernels, data_im, data_offset, data_mask, height_im, width_im, kernel_h, kernel_w,
       pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, channel_per_deformable_group,
       batch_size, channels, deformable_group, height_col, width_col, data_col);

diff --git a/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.h b/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.h
index 3457e96..c856831 100644
--- a/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.h
+++ b/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_im2col_cuda.h
@@ -1,3 +1,4 @@
+
 /*!
  ******************* BEGIN Caffe Copyright Notice and Disclaimer ****************
  *
diff --git a/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_psroi_pooling_cuda.cu b/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_psroi_pooling_cuda.cu
index 295657c..0190217 100644
--- a/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_psroi_pooling_cuda.cu
+++ b/src/lib/models/networks/DCNv2/src/cuda/dcn_v2_psroi_pooling_cuda.cu
@@ -6,10 +6,18 @@
  * \author Yi Li, Guodong Zhang, Jifeng Dai
 */
 /***************** Adapted by Charles Shang *********************/
-#include "dcn_v2_psroi_pooling_cuda.h"
+
 #include <cstdio>
 #include <algorithm>
 #include <cstring>
+#include <iostream>
+
+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+
+#include <THC/THC.h>
+#include <THC/THCAtomics.cuh>
+#include <THC/THCDeviceUtils.cuh>

 #define CUDA_KERNEL_LOOP(i, n)                        \
   for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
@@ -22,10 +30,11 @@ inline int GET_BLOCKS(const int N)
   return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;
 }

-__device__ float bilinear_interp(
-    const float *data,
-    const float x,
-    const float y,
+template <typename T>
+__device__ T bilinear_interp_cuda(
+    const T *data,
+    const T x,
+    const T y,
     const int width,
     const int height)
 {
@@ -33,34 +42,38 @@ __device__ float bilinear_interp(
   int x2 = ceil(x);
   int y1 = floor(y);
   int y2 = ceil(y);
-  float dist_x = (float)(x - x1);
-  float dist_y = (float)(y - y1);
-  float value11 = data[y1 * width + x1];
-  float value12 = data[y2 * width + x1];
-  float value21 = data[y1 * width + x2];
-  float value22 = data[y2 * width + x2];
-  float value = (1 - dist_x) * (1 - dist_y) * value11 + (1 - dist_x) * dist_y * value12 + dist_x * (1 - dist_y) * value21 + dist_x * dist_y * value22;
+  T dist_x = static_cast<T>(x - x1);
+  T dist_y = static_cast<T>(y - y1);
+  T value11 = data[y1 * width + x1];
+  T value12 = data[y2 * width + x1];
+  T value21 = data[y1 * width + x2];
+  T value22 = data[y2 * width + x2];
+  T value = (1 - dist_x) * (1 - dist_y) * value11 +
+            (1 - dist_x) * dist_y * value12 +
+            dist_x * (1 - dist_y) * value21 +
+            dist_x * dist_y * value22;
   return value;
 }

-__global__ void DeformablePSROIPoolForwardKernel(
+template <typename T>
+__global__ void DeformablePSROIPoolForwardKernelCuda(
     const int count,
-    const float *bottom_data,
-    const float spatial_scale,
+    const T *bottom_data,
+    const T spatial_scale,
     const int channels,
     const int height, const int width,
     const int pooled_height, const int pooled_width,
-    const float *bottom_rois, const float *bottom_trans,
+    const T *bottom_rois, const T *bottom_trans,
     const int no_trans,
-    const float trans_std,
+    const T trans_std,
     const int sample_per_part,
     const int output_dim,
     const int group_size,
     const int part_size,
     const int num_classes,
     const int channels_each_class,
-    float *top_data,
-    float *top_count)
+    T *top_data,
+    T *top_count)
 {
   CUDA_KERNEL_LOOP(index, count)
   {
@@ -71,49 +84,49 @@ __global__ void DeformablePSROIPoolForwardKernel(
     int n = index / pooled_width / pooled_height / output_dim;

     // [start, end) interval for spatial sampling
-    const float *offset_bottom_rois = bottom_rois + n * 5;
+    const T *offset_bottom_rois = bottom_rois + n * 5;
     int roi_batch_ind = offset_bottom_rois[0];
-    float roi_start_w = (float)(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
-    float roi_start_h = (float)(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
-    float roi_end_w = (float)(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
-    float roi_end_h = (float)(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;
+    T roi_start_w = static_cast<T>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
+    T roi_start_h = static_cast<T>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
+    T roi_end_w = static_cast<T>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
+    T roi_end_h = static_cast<T>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;

     // Force too small ROIs to be 1x1
-    float roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
-    float roi_height = max(roi_end_h - roi_start_h, 0.1);
+    T roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
+    T roi_height = max(roi_end_h - roi_start_h, 0.1);

     // Compute w and h at bottom
-    float bin_size_h = roi_height / (float)(pooled_height);
-    float bin_size_w = roi_width / (float)(pooled_width);
+    T bin_size_h = roi_height / static_cast<T>(pooled_height);
+    T bin_size_w = roi_width / static_cast<T>(pooled_width);

-    float sub_bin_size_h = bin_size_h / (float)(sample_per_part);
-    float sub_bin_size_w = bin_size_w / (float)(sample_per_part);
+    T sub_bin_size_h = bin_size_h / static_cast<T>(sample_per_part);
+    T sub_bin_size_w = bin_size_w / static_cast<T>(sample_per_part);

-    int part_h = floor((float)(ph) / pooled_height * part_size);
-    int part_w = floor((float)(pw) / pooled_width * part_size);
+    int part_h = floor(static_cast<T>(ph) / pooled_height * part_size);
+    int part_w = floor(static_cast<T>(pw) / pooled_width * part_size);
     int class_id = ctop / channels_each_class;
-    float trans_x = no_trans ? (float)(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;
-    float trans_y = no_trans ? (float)(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;
+    T trans_x = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;
+    T trans_y = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;

-    float wstart = (float)(pw)*bin_size_w + roi_start_w;
+    T wstart = static_cast<T>(pw) * bin_size_w + roi_start_w;
     wstart += trans_x * roi_width;
-    float hstart = (float)(ph)*bin_size_h + roi_start_h;
+    T hstart = static_cast<T>(ph) * bin_size_h + roi_start_h;
     hstart += trans_y * roi_height;

-    float sum = 0;
+    T sum = 0;
     int count = 0;
-    int gw = floor((float)(pw)*group_size / pooled_width);
-    int gh = floor((float)(ph)*group_size / pooled_height);
+    int gw = floor(static_cast<T>(pw) * group_size / pooled_width);
+    int gh = floor(static_cast<T>(ph) * group_size / pooled_height);
     gw = min(max(gw, 0), group_size - 1);
     gh = min(max(gh, 0), group_size - 1);

-    const float *offset_bottom_data = bottom_data + (roi_batch_ind * channels) * height * width;
+    const T *offset_bottom_data = bottom_data + (roi_batch_ind * channels) * height * width;
     for (int ih = 0; ih < sample_per_part; ih++)
     {
       for (int iw = 0; iw < sample_per_part; iw++)
       {
-        float w = wstart + iw * sub_bin_size_w;
-        float h = hstart + ih * sub_bin_size_h;
+        T w = wstart + iw * sub_bin_size_w;
+        T h = hstart + ih * sub_bin_size_h;
         // bilinear interpolation
         if (w < -0.5 || w > width - 0.5 || h < -0.5 || h > height - 0.5)
         {
@@ -122,32 +135,33 @@ __global__ void DeformablePSROIPoolForwardKernel(
         w = min(max(w, 0.), width - 1.);
         h = min(max(h, 0.), height - 1.);
         int c = (ctop * group_size + gh) * group_size + gw;
-        float val = bilinear_interp(offset_bottom_data + c * height * width, w, h, width, height);
+        T val = bilinear_interp_cuda(offset_bottom_data + c * height * width, w, h, width, height);
         sum += val;
         count++;
       }
     }
-    top_data[index] = count == 0 ? (float)(0) : sum / count;
+    top_data[index] = count == 0 ? static_cast<T>(0) : sum / count;
     top_count[index] = count;
   }
 }

-__global__ void DeformablePSROIPoolBackwardAccKernel(
+template <typename T>
+__global__ void DeformablePSROIPoolBackwardAccKernelCuda(
     const int count,
-    const float *top_diff,
-    const float *top_count,
+    const T *top_diff,
+    const T *top_count,
     const int num_rois,
-    const float spatial_scale,
+    const T spatial_scale,
     const int channels,
     const int height, const int width,
     const int pooled_height, const int pooled_width,
     const int output_dim,
-    float *bottom_data_diff, float *bottom_trans_diff,
-    const float *bottom_data,
-    const float *bottom_rois,
-    const float *bottom_trans,
+    T *bottom_data_diff, T *bottom_trans_diff,
+    const T *bottom_data,
+    const T *bottom_rois,
+    const T *bottom_trans,
     const int no_trans,
-    const float trans_std,
+    const T trans_std,
     const int sample_per_part,
     const int group_size,
     const int part_size,
@@ -163,44 +177,44 @@ __global__ void DeformablePSROIPoolBackwardAccKernel(
     int n = index / pooled_width / pooled_height / output_dim;

     // [start, end) interval for spatial sampling
-    const float *offset_bottom_rois = bottom_rois + n * 5;
+    const T *offset_bottom_rois = bottom_rois + n * 5;
     int roi_batch_ind = offset_bottom_rois[0];
-    float roi_start_w = (float)(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
-    float roi_start_h = (float)(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
-    float roi_end_w = (float)(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
-    float roi_end_h = (float)(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;
+    T roi_start_w = static_cast<T>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
+    T roi_start_h = static_cast<T>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
+    T roi_end_w = static_cast<T>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
+    T roi_end_h = static_cast<T>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;

     // Force too small ROIs to be 1x1
-    float roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
-    float roi_height = max(roi_end_h - roi_start_h, 0.1);
+    T roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
+    T roi_height = max(roi_end_h - roi_start_h, 0.1);

     // Compute w and h at bottom
-    float bin_size_h = roi_height / (float)(pooled_height);
-    float bin_size_w = roi_width / (float)(pooled_width);
+    T bin_size_h = roi_height / static_cast<T>(pooled_height);
+    T bin_size_w = roi_width / static_cast<T>(pooled_width);

-    float sub_bin_size_h = bin_size_h / (float)(sample_per_part);
-    float sub_bin_size_w = bin_size_w / (float)(sample_per_part);
+    T sub_bin_size_h = bin_size_h / static_cast<T>(sample_per_part);
+    T sub_bin_size_w = bin_size_w / static_cast<T>(sample_per_part);

-    int part_h = floor((float)(ph) / pooled_height * part_size);
-    int part_w = floor((float)(pw) / pooled_width * part_size);
+    int part_h = floor(static_cast<T>(ph) / pooled_height * part_size);
+    int part_w = floor(static_cast<T>(pw) / pooled_width * part_size);
     int class_id = ctop / channels_each_class;
-    float trans_x = no_trans ? (float)(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;
-    float trans_y = no_trans ? (float)(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;
+    T trans_x = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;
+    T trans_y = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;

-    float wstart = (float)(pw)*bin_size_w + roi_start_w;
+    T wstart = static_cast<T>(pw) * bin_size_w + roi_start_w;
     wstart += trans_x * roi_width;
-    float hstart = (float)(ph)*bin_size_h + roi_start_h;
+    T hstart = static_cast<T>(ph) * bin_size_h + roi_start_h;
     hstart += trans_y * roi_height;

     if (top_count[index] <= 0)
     {
       continue;
     }
-    float diff_val = top_diff[index] / top_count[index];
-    const float *offset_bottom_data = bottom_data + roi_batch_ind * channels * height * width;
-    float *offset_bottom_data_diff = bottom_data_diff + roi_batch_ind * channels * height * width;
-    int gw = floor((float)(pw)*group_size / pooled_width);
-    int gh = floor((float)(ph)*group_size / pooled_height);
+    T diff_val = top_diff[index] / top_count[index];
+    const T *offset_bottom_data = bottom_data + roi_batch_ind * channels * height * width;
+    T *offset_bottom_data_diff = bottom_data_diff + roi_batch_ind * channels * height * width;
+    int gw = floor(static_cast<T>(pw) * group_size / pooled_width);
+    int gh = floor(static_cast<T>(ph) * group_size / pooled_height);
     gw = min(max(gw, 0), group_size - 1);
     gh = min(max(gh, 0), group_size - 1);

@@ -208,8 +222,8 @@ __global__ void DeformablePSROIPoolBackwardAccKernel(
     {
       for (int iw = 0; iw < sample_per_part; iw++)
       {
-        float w = wstart + iw * sub_bin_size_w;
-        float h = hstart + ih * sub_bin_size_h;
+        T w = wstart + iw * sub_bin_size_w;
+        T h = hstart + ih * sub_bin_size_h;
         // bilinear interpolation
         if (w < -0.5 || w > width - 0.5 || h < -0.5 || h > height - 0.5)
         {
@@ -223,11 +237,11 @@ __global__ void DeformablePSROIPoolBackwardAccKernel(
         int x1 = ceil(w);
         int y0 = floor(h);
         int y1 = ceil(h);
-        float dist_x = w - x0, dist_y = h - y0;
-        float q00 = (1 - dist_x) * (1 - dist_y);
-        float q01 = (1 - dist_x) * dist_y;
-        float q10 = dist_x * (1 - dist_y);
-        float q11 = dist_x * dist_y;
+        T dist_x = w - x0, dist_y = h - y0;
+        T q00 = (1 - dist_x) * (1 - dist_y);
+        T q01 = (1 - dist_x) * dist_y;
+        T q10 = dist_x * (1 - dist_y);
+        T q11 = dist_x * dist_y;
         int bottom_index_base = c * height * width;
         atomicAdd(offset_bottom_data_diff + bottom_index_base + y0 * width + x0, q00 * diff_val);
         atomicAdd(offset_bottom_data_diff + bottom_index_base + y1 * width + x0, q01 * diff_val);
@@ -238,13 +252,13 @@ __global__ void DeformablePSROIPoolBackwardAccKernel(
         {
           continue;
         }
-        float U00 = offset_bottom_data[bottom_index_base + y0 * width + x0];
-        float U01 = offset_bottom_data[bottom_index_base + y1 * width + x0];
-        float U10 = offset_bottom_data[bottom_index_base + y0 * width + x1];
-        float U11 = offset_bottom_data[bottom_index_base + y1 * width + x1];
-        float diff_x = (U11 * dist_y + U10 * (1 - dist_y) - U01 * dist_y - U00 * (1 - dist_y)) * trans_std * diff_val;
+        T U00 = offset_bottom_data[bottom_index_base + y0 * width + x0];
+        T U01 = offset_bottom_data[bottom_index_base + y1 * width + x0];
+        T U10 = offset_bottom_data[bottom_index_base + y0 * width + x1];
+        T U11 = offset_bottom_data[bottom_index_base + y1 * width + x1];
+        T diff_x = (U11 * dist_y + U10 * (1 - dist_y) - U01 * dist_y - U00 * (1 - dist_y)) * trans_std * diff_val;
         diff_x *= roi_width;
-        float diff_y = (U11 * dist_x + U01 * (1 - dist_x) - U10 * dist_x - U00 * (1 - dist_x)) * trans_std * diff_val;
+        T diff_y = (U11 * dist_x + U01 * (1 - dist_x) - U10 * dist_x - U00 * (1 - dist_x)) * trans_std * diff_val;
         diff_y *= roi_height;

         atomicAdd(bottom_trans_diff + (((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w, diff_x);
@@ -254,100 +268,152 @@ __global__ void DeformablePSROIPoolBackwardAccKernel(
   }
 }

-void DeformablePSROIPoolForward(cudaStream_t stream,
-                                const float *data,
-                                const float *bbox,
-                                const float *trans,
-                                float *out,
-                                float *top_count,
-                                const int batch,
-                                const int channels,
-                                const int height,
-                                const int width,
-                                const int num_bbox,
-                                const int channels_trans,
-                                const int no_trans,
-                                const float spatial_scale,
-                                const int output_dim,
-                                const int group_size,
-                                const int pooled_size,
-                                const int part_size,
-                                const int sample_per_part,
-                                const float trans_std)
+std::tuple<at::Tensor, at::Tensor>
+dcn_v2_psroi_pooling_cuda_forward(const at::Tensor &input,
+                                  const at::Tensor &bbox,
+                                  const at::Tensor &trans,
+                                  const int no_trans,
+                                  const float spatial_scale,
+                                  const int output_dim,
+                                  const int group_size,
+                                  const int pooled_size,
+                                  const int part_size,
+                                  const int sample_per_part,
+                                  const float trans_std)
 {
+  AT_ASSERTM(input.is_cuda(), "input must be a CUDA tensor");
+  AT_ASSERTM(bbox.is_cuda(), "rois must be a CUDA tensor");
+  AT_ASSERTM(trans.is_cuda(), "trans must be a CUDA tensor");
+
+  const int batch = input.size(0);
+  const int channels = input.size(1);
+  const int height = input.size(2);
+  const int width = input.size(3);
+  const int channels_trans = no_trans ? 2 : trans.size(1);
+  const int num_bbox = bbox.size(0);

-  const float *bottom_data = data;
-  const float *bottom_rois = bbox;
-  const float *bottom_trans = no_trans ? NULL : trans;
-  float *top_data = out;
-  float *top_count_data = top_count;
+  AT_ASSERTM(channels == output_dim, "input channels and output channels must equal");
+  auto pooled_height = pooled_size;
+  auto pooled_width = pooled_size;
+
+  auto out = at::empty({num_bbox, output_dim, pooled_height, pooled_width}, input.options());
+  long out_size = num_bbox * output_dim * pooled_height * pooled_width;
+  auto top_count = at::zeros({num_bbox, output_dim, pooled_height, pooled_width}, input.options());

-  const int pooled_height = pooled_size;
-  const int pooled_width = pooled_size;
-  const int count = num_bbox * output_dim * pooled_height * pooled_width;
   const int num_classes = no_trans ? 1 : channels_trans / 2;
   const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;

-  DeformablePSROIPoolForwardKernel<<<GET_BLOCKS(count), CUDA_NUM_THREADS, 0, stream>>>(
-      count, bottom_data, spatial_scale, channels, height, width, pooled_height, pooled_width,
-      bottom_rois, bottom_trans, no_trans, trans_std, sample_per_part, output_dim,
-      group_size, part_size, num_classes, channels_each_class, top_data, top_count_data);
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

-  cudaError_t err = cudaGetLastError();
-  if (err != cudaSuccess)
+  if (out.numel() == 0)
   {
-    printf("error in DeformablePSROIPoolForward: %s\n", cudaGetErrorString(err));
+    THCudaCheck(cudaGetLastError());
+    return std::make_tuple(out, top_count);
   }
+
+  dim3 grid(std::min(THCCeilDiv(out_size, 512L), 4096L));
+  dim3 block(512);
+
+  AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "dcn_v2_psroi_pooling_cuda_forward", [&] {
+    DeformablePSROIPoolForwardKernelCuda<scalar_t><<<grid, block, 0, stream>>>(
+        out_size,
+        input.contiguous().data_ptr<scalar_t>(),
+        spatial_scale,
+        channels,
+        height, width,
+        pooled_height,
+        pooled_width,
+        bbox.contiguous().data_ptr<scalar_t>(),
+        trans.contiguous().data_ptr<scalar_t>(),
+        no_trans,
+        trans_std,
+        sample_per_part,
+        output_dim,
+        group_size,
+        part_size,
+        num_classes,
+        channels_each_class,
+        out.data_ptr<scalar_t>(),
+        top_count.data_ptr<scalar_t>());
+  });
+  THCudaCheck(cudaGetLastError());
+  return std::make_tuple(out, top_count);
 }

-void DeformablePSROIPoolBackwardAcc(cudaStream_t stream,
-                                    const float *out_grad,
-                                    const float *data,
-                                    const float *bbox,
-                                    const float *trans,
-                                    const float *top_count,
-                                    float *in_grad,
-                                    float *trans_grad,
-                                    const int batch,
-                                    const int channels,
-                                    const int height,
-                                    const int width,
-                                    const int num_bbox,
-                                    const int channels_trans,
-                                    const int no_trans,
-                                    const float spatial_scale,
-                                    const int output_dim,
-                                    const int group_size,
-                                    const int pooled_size,
-                                    const int part_size,
-                                    const int sample_per_part,
-                                    const float trans_std)
+std::tuple<at::Tensor, at::Tensor>
+dcn_v2_psroi_pooling_cuda_backward(const at::Tensor &out_grad,
+                                   const at::Tensor &input,
+                                   const at::Tensor &bbox,
+                                   const at::Tensor &trans,
+                                   const at::Tensor &top_count,
+                                   const int no_trans,
+                                   const float spatial_scale,
+                                   const int output_dim,
+                                   const int group_size,
+                                   const int pooled_size,
+                                   const int part_size,
+                                   const int sample_per_part,
+                                   const float trans_std)
 {
-  // LOG(INFO) << "DeformablePSROIPoolBackward";
-  const float *top_diff = out_grad;
-  const float *bottom_data = data;
-  const float *bottom_rois = bbox;
-  const float *bottom_trans = no_trans ? NULL : trans;
-  float *bottom_data_diff = in_grad;
-  float *bottom_trans_diff = no_trans ? NULL : trans_grad;
-  const float *top_count_data = top_count;
-
-  const int num_rois = num_bbox;
-  const int pooled_height = pooled_size;
-  const int pooled_width = pooled_size;
-  const int count = num_bbox * output_dim * pooled_height * pooled_width;
+  AT_ASSERTM(out_grad.is_cuda(), "out_grad must be a CUDA tensor");
+  AT_ASSERTM(input.is_cuda(), "input must be a CUDA tensor");
+  AT_ASSERTM(bbox.is_cuda(), "bbox must be a CUDA tensor");
+  AT_ASSERTM(trans.is_cuda(), "trans must be a CUDA tensor");
+  AT_ASSERTM(top_count.is_cuda(), "top_count must be a CUDA tensor");
+
+  const int batch = input.size(0);
+  const int channels = input.size(1);
+  const int height = input.size(2);
+  const int width = input.size(3);
+  const int channels_trans = no_trans ? 2 : trans.size(1);
+  const int num_bbox = bbox.size(0);
+
+  AT_ASSERTM(channels == output_dim, "input channels and output channels must equal");
+  auto pooled_height = pooled_size;
+  auto pooled_width = pooled_size;
+  long out_size = num_bbox * output_dim * pooled_height * pooled_width;
   const int num_classes = no_trans ? 1 : channels_trans / 2;
   const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;

-  DeformablePSROIPoolBackwardAccKernel<<<GET_BLOCKS(count), CUDA_NUM_THREADS, 0, stream>>>(
-      count, top_diff, top_count_data, num_rois, spatial_scale, channels, height, width,
-      pooled_height, pooled_width, output_dim, bottom_data_diff, bottom_trans_diff,
-      bottom_data, bottom_rois, bottom_trans, no_trans, trans_std, sample_per_part,
-      group_size, part_size, num_classes, channels_each_class);
+  auto input_grad = at::zeros({batch, channels, height, width}, out_grad.options());
+  auto trans_grad = at::zeros_like(trans);

-  cudaError_t err = cudaGetLastError();
-  if (err != cudaSuccess)
+  if (input_grad.numel() == 0)
   {
-    printf("error in DeformablePSROIPoolForward: %s\n", cudaGetErrorString(err));
+    THCudaCheck(cudaGetLastError());
+    return std::make_tuple(input_grad, trans_grad);
   }
-}
\ No newline at end of file
+
+  dim3 grid(std::min(THCCeilDiv(out_size, 512L), 4096L));
+  dim3 block(512);
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  AT_DISPATCH_FLOATING_TYPES(out_grad.scalar_type(), "dcn_v2_psroi_pooling_cuda_backward", [&] {
+    DeformablePSROIPoolBackwardAccKernelCuda<scalar_t><<<grid, block, 0, stream>>>(
+        out_size,
+        out_grad.contiguous().data_ptr<scalar_t>(),
+        top_count.contiguous().data_ptr<scalar_t>(),
+        num_bbox,
+        spatial_scale,
+        channels,
+        height,
+        width,
+        pooled_height,
+        pooled_width,
+        output_dim,
+        input_grad.contiguous().data_ptr<scalar_t>(),
+        trans_grad.contiguous().data_ptr<scalar_t>(),
+        input.contiguous().data_ptr<scalar_t>(),
+        bbox.contiguous().data_ptr<scalar_t>(),
+        trans.contiguous().data_ptr<scalar_t>(),
+        no_trans,
+        trans_std,
+        sample_per_part,
+        group_size,
+        part_size,
+        num_classes,
+        channels_each_class);
+  });
+  THCudaCheck(cudaGetLastError());
+  return std::make_tuple(input_grad, trans_grad);
+}
diff --git a/src/lib/models/networks/DCNv2/src/dcn_v2.h b/src/lib/models/networks/DCNv2/src/dcn_v2.h
index 1a97ff0..de670bf 100644
--- a/src/lib/models/networks/DCNv2/src/dcn_v2.h
+++ b/src/lib/models/networks/DCNv2/src/dcn_v2.h
@@ -1,20 +1,190 @@
-void dcn_v2_forward(THFloatTensor *input, THFloatTensor *weight,
-                        THFloatTensor *bias, THFloatTensor *ones,
-                        THFloatTensor *offset, THFloatTensor *mask,
-                        THFloatTensor *output, THFloatTensor *columns,
-                        const int pad_h, const int pad_w,
-                        const int stride_h, const int stride_w,
-                        const int dilation_h, const int dilation_w,
-                        const int deformable_group);
-void dcn_v2_backward(THFloatTensor *input, THFloatTensor *weight,
-                        THFloatTensor *bias, THFloatTensor *ones,
-                        THFloatTensor *offset, THFloatTensor *mask,
-                        THFloatTensor *output, THFloatTensor *columns,
-                        THFloatTensor *grad_input, THFloatTensor *grad_weight,
-                        THFloatTensor *grad_bias, THFloatTensor *grad_offset,
-                        THFloatTensor *grad_mask, THFloatTensor *grad_output,
-                        int kernel_h, int kernel_w,
-                        int stride_h, int stride_w,
-                        int pad_h, int pad_w,
-                        int dilation_h, int dilation_w,
-                        int deformable_group);
\ No newline at end of file
+#pragma once
+
+#include "cpu/vision.h"
+
+#ifdef WITH_CUDA
+#include "cuda/vision.h"
+#endif
+
+at::Tensor
+dcn_v2_forward(const at::Tensor &input,
+               const at::Tensor &weight,
+               const at::Tensor &bias,
+               const at::Tensor &offset,
+               const at::Tensor &mask,
+               const int kernel_h,
+               const int kernel_w,
+               const int stride_h,
+               const int stride_w,
+               const int pad_h,
+               const int pad_w,
+               const int dilation_h,
+               const int dilation_w,
+               const int deformable_group)
+{
+    if (input.type().is_cuda())
+    {
+#ifdef WITH_CUDA
+        return dcn_v2_cuda_forward(input, weight, bias, offset, mask,
+                                   kernel_h, kernel_w,
+                                   stride_h, stride_w,
+                                   pad_h, pad_w,
+                                   dilation_h, dilation_w,
+                                   deformable_group);
+#else
+        AT_ERROR("Not compiled with GPU support");
+#endif
+    }
+    else{
+        return dcn_v2_cpu_forward(input, weight, bias, offset, mask,
+                                   kernel_h, kernel_w,
+                                   stride_h, stride_w,
+                                   pad_h, pad_w,
+                                   dilation_h, dilation_w,
+                                   deformable_group);
+    }
+}
+
+std::vector<at::Tensor>
+dcn_v2_backward(const at::Tensor &input,
+                const at::Tensor &weight,
+                const at::Tensor &bias,
+                const at::Tensor &offset,
+                const at::Tensor &mask,
+                const at::Tensor &grad_output,
+                int kernel_h, int kernel_w,
+                int stride_h, int stride_w,
+                int pad_h, int pad_w,
+                int dilation_h, int dilation_w,
+                int deformable_group)
+{
+    if (input.type().is_cuda())
+    {
+#ifdef WITH_CUDA
+        return dcn_v2_cuda_backward(input,
+                                    weight,
+                                    bias,
+                                    offset,
+                                    mask,
+                                    grad_output,
+                                    kernel_h, kernel_w,
+                                    stride_h, stride_w,
+                                    pad_h, pad_w,
+                                    dilation_h, dilation_w,
+                                    deformable_group);
+#else
+        AT_ERROR("Not compiled with GPU support");
+#endif
+    }
+    else{
+        return dcn_v2_cpu_backward(input,
+                                    weight,
+                                    bias,
+                                    offset,
+                                    mask,
+                                    grad_output,
+                                    kernel_h, kernel_w,
+                                    stride_h, stride_w,
+                                    pad_h, pad_w,
+                                    dilation_h, dilation_w,
+                                    deformable_group);
+    }
+}
+
+std::tuple<at::Tensor, at::Tensor>
+dcn_v2_psroi_pooling_forward(const at::Tensor &input,
+                             const at::Tensor &bbox,
+                             const at::Tensor &trans,
+                             const int no_trans,
+                             const float spatial_scale,
+                             const int output_dim,
+                             const int group_size,
+                             const int pooled_size,
+                             const int part_size,
+                             const int sample_per_part,
+                             const float trans_std)
+{
+    if (input.type().is_cuda())
+    {
+#ifdef WITH_CUDA
+        return dcn_v2_psroi_pooling_cuda_forward(input,
+                                                 bbox,
+                                                 trans,
+                                                 no_trans,
+                                                 spatial_scale,
+                                                 output_dim,
+                                                 group_size,
+                                                 pooled_size,
+                                                 part_size,
+                                                 sample_per_part,
+                                                 trans_std);
+#else
+        AT_ERROR("Not compiled with GPU support");
+#endif
+    }
+    else{
+        return dcn_v2_psroi_pooling_cpu_forward(input,
+                                                 bbox,
+                                                 trans,
+                                                 no_trans,
+                                                 spatial_scale,
+                                                 output_dim,
+                                                 group_size,
+                                                 pooled_size,
+                                                 part_size,
+                                                 sample_per_part,
+                                                 trans_std);
+    }
+}
+
+std::tuple<at::Tensor, at::Tensor>
+dcn_v2_psroi_pooling_backward(const at::Tensor &out_grad,
+                              const at::Tensor &input,
+                              const at::Tensor &bbox,
+                              const at::Tensor &trans,
+                              const at::Tensor &top_count,
+                              const int no_trans,
+                              const float spatial_scale,
+                              const int output_dim,
+                              const int group_size,
+                              const int pooled_size,
+                              const int part_size,
+                              const int sample_per_part,
+                              const float trans_std)
+{
+    if (input.type().is_cuda())
+    {
+#ifdef WITH_CUDA
+        return dcn_v2_psroi_pooling_cuda_backward(out_grad,
+                                                  input,
+                                                  bbox,
+                                                  trans,
+                                                  top_count,
+                                                  no_trans,
+                                                  spatial_scale,
+                                                  output_dim,
+                                                  group_size,
+                                                  pooled_size,
+                                                  part_size,
+                                                  sample_per_part,
+                                                  trans_std);
+#else
+        AT_ERROR("Not compiled with GPU support");
+#endif
+    }
+    else{
+        return dcn_v2_psroi_pooling_cpu_backward(out_grad,
+                                                  input,
+                                                  bbox,
+                                                  trans,
+                                                  top_count,
+                                                  no_trans,
+                                                  spatial_scale,
+                                                  output_dim,
+                                                  group_size,
+                                                  pooled_size,
+                                                  part_size,
+                                                  sample_per_part,
+                                                  trans_std);
+    }
+}
\ No newline at end of file
diff --git a/src/lib/models/networks/pose_dla_dcn.py b/src/lib/models/networks/pose_dla_dcn.py
index 7cb6869..8b7b15c 100644
--- a/src/lib/models/networks/pose_dla_dcn.py
+++ b/src/lib/models/networks/pose_dla_dcn.py
@@ -352,7 +352,12 @@ class DeformConv(nn.Module):
         self.conv = DCN(chi, cho, kernel_size=(3,3), stride=1, padding=1, dilation=1, deformable_groups=1)

     def forward(self, x):
+        #import pdb;pdb.set_trace()
+        x = x.cuda().float()
+        x = x.to(torch.float32).cuda()
+        #print(x.type())
         x = self.conv(x)
+        x = x.to(torch.float32).cuda()
         x = self.actf(x)
         return x

@@ -363,7 +368,7 @@ class IDAUp(nn.Module):
         super(IDAUp, self).__init__()
         for i in range(1, len(channels)):
             c = channels[i]
-            f = int(up_f[i])
+            f = int(up_f[i])
             proj = DeformConv(c, o)
             node = DeformConv(o, o)

@@ -477,6 +482,7 @@ class DLASeg(nn.Module):
         self.ida_up(y, 0, len(y))

         z = {}
+        print(self.heads)
         for head in self.heads:
             z[head] = self.__getattr__(head)(y[-1])
         return [z]