# tinytensor **Repository Path**: zhengankun/tinytensor ## Basic Information - **Project Name**: tinytensor - **Description**: C++实现的tensor库 - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 3 - **Forks**: 1 - **Created**: 2025-07-05 - **Last Updated**: 2026-04-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # TinyTensor A lightweight, NumPy-like tensor library for C++ and Python with **CUDA GPU support**, **stream management**, and **zero-copy tensor views**. ## Features - **NumPy-like API**: `tt.array()`, `tt.zeros()`, `tt.ones()`, broadcasting, reductions, `@` operator - **CUDA Support**: Device-to-device transfers, kernel dispatch, stream management - **CuPy-like Stream API**: `Stream()`, `with stream:`, `stream.record()`, `Event`, `stream.wait_event()` - **Device Context Manager**: `with tt.Device(0):` for multi-GPU programming - **Tensor Views**: Zero-copy `slice()`, `as_strided()`, `reshape()` - **Custom dtypes**: Extensible type system with auto-registered kernels ## Quick Start ### Python ```python import tinytensor as tt import numpy as np # Create tensors a = tt.array([[1.0, 2.0], [3.0, 4.0]], dtype=tt.float32) b = tt.array([[5.0, 6.0], [7.0, 8.0]], dtype=tt.float32) # Arithmetic c = a + b * 2.0 # Matrix multiply (@ operator) a2d = tt.array(np.random.rand(3, 4).astype(np.float32)) b2d = tt.array(np.random.rand(4, 2).astype(np.float32)) c2d = a2d @ b2d # (3, 2) # Batch matrix multiply a3d = tt.array(np.random.rand(2, 3, 4).astype(np.float32)) b3d = tt.array(np.random.rand(2, 4, 5).astype(np.float32)) c3d = a3d @ b3d # (2, 3, 5) # Reductions total = a.sum() row_sum = a.sum(axis=1) # GPU computing if tt.is_cuda_available(): with tt.Device(0): a_gpu = tt.array(np.random.rand(1000, 1000).astype(np.float32)) b_gpu = tt.array(np.random.rand(1000, 1000).astype(np.float32)) c_gpu = a_gpu @ b_gpu c_cpu = c_gpu.to('cpu').to_numpy() ``` ### C++ ```cpp #include "tinytensor.h" using namespace tinytensor; Tensor a = Tensor::Zeros(kFloat32, Shape{3, 4}); Tensor b = Tensor::Ones(kFloat32, Shape{4, 2}); // Broadcasting Tensor c = a + 1.0f; // Batch matmul Tensor x(kFloat32, Shape{2, 3, 4}); Tensor y(kFloat32, Shape{2, 4, 5}); Tensor z = DISPATCH_BINARY(bmm, x, y); // GPU if (IsCudaAvailable()) { Tensor a_gpu(kFloat32, Shape{100, 100}, Device::CUDA(0)); a_gpu.fill(1.0f); Tensor a_cpu = a_gpu.to(Device::CPU()); } ``` ## Installation ### From source ```bash mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release -DTINYTENSOR_BUILD_PYTHON=ON -DTINYTENSOR_USE_CUDA=ON -DPython_EXECUTABLE=$(which python3) make -j$(nproc) ``` ### pip install ```bash pip install . ``` ## API Reference ### Tensor Creation | Function | Description | |----------|-------------| | `tt.array(data, dtype)` | Create from lists or numpy arrays | | `tt.zeros(shape, dtype)` | Zero-filled tensor | | `tt.ones(shape, dtype)` | One-filled tensor | | `tt.empty(shape, dtype)` | Uninitialized tensor | | `tt.arange(start, stop, step)` | Range tensor | | `tt.linspace(start, stop, num)` | Linearly spaced values | | `tt.eye(n, m)` | Identity matrix | ### Tensor Properties | Property | Description | |----------|-------------| | `t.shape` | Shape as tuple | | `t.ndim` | Number of dimensions | | `t.size` | Total number of elements | | `t.dtype` | Data type constant | | `t.device` | Device string (e.g. `"cpu"`, `"cuda:0"`) | ### Operations | Operation | Description | |-----------|-------------| | `a + b`, `a - b`, `a * b`, `a / b` | Element-wise arithmetic | | `a @ b` | Matrix multiply / dot / batch matmul | | `a.sum(axis)`, `a.mean(axis)` | Reductions | | `a.min()`, `a.max()` | Min/max | | `tt.exp(a)` | Exponential | | `a == b`, `a < b`, `a > b` | Comparisons | | `a.reshape(shape)` | Reshape | | `a.flatten()` | Flatten to 1D | | `a.transpose(axes)` | Transpose | | `tt.concatenate([a, b], axis)` | Concatenate | | `tt.stack([a, b], axis)` | Stack | ### Device Management | Function | Description | |----------|-------------| | `tt.is_cuda_available()` | Check CUDA availability | | `tt.get_cuda_device_count()` | Number of GPUs | | `tt.set_current_device('cuda:0')` | Set default device | | `tt.get_current_device()` | Get current device string | ### Stream Management (CuPy-like) | API | Description | |-----|-------------| | `tt.Stream()` | Create stream on current device | | `tt.Stream(non_blocking=True)` | Non-blocking stream | | `with stream:` | Context manager — use stream for operations in block | | `stream.use()` | **(deprecated)** Set as current stream | | `stream.synchronize()` | Wait for stream completion | | `stream.done` | Check if stream is idle | | `stream.record()` | Record an `Event` on this stream | | `stream.wait_event(event)` | Wait for event before executing | | `stream.wait(other)` | Wait for another stream | | `tt.null_stream()` | Get default stream singleton | | `tt.synchronize()` | Synchronize current stream | ### Events | API | Description | |-----|-------------| | `tt.Event()` | Create event on current device | | `event.record()` | Record on current stream | | `event.wait(stream)` | Make stream wait for event | | `event.done` | Check if event has been recorded | | `event.synchronize()` | Block CPU until event is recorded | ### Multi-device Example ```python stream0 = None stream1 = None with tt.Device(0): stream0 = tt.Stream() with tt.Device(1): stream1 = tt.Stream() with tt.Device(0), stream0: data0 = tt.array(np.ones((100, 100), dtype=np.float32)) result0 = data0 @ data0 with tt.Device(1), stream1: data1 = tt.array(np.ones((100, 100), dtype=np.float32)) result1 = data1 @ data1 stream0.synchronize() stream1.synchronize() ``` ## Supported dtypes | Constant | Python | NumPy dtype | |----------|--------|-------------| | `tt.float32` | float32 | `np.float32` | | `tt.float64` | float64 | `np.float64` | | `tt.int32` | int32 | `np.int32` | | `tt.int64` | int64 | `np.int64` | | `tt.bool_` | bool | `np.bool_` | | `tt.float16` | float16 | `np.float16` | ## CUDA Kernel Registration All element-wise operations (add, sub, mul, div, comparisons) and reductions (sum, mean, min, max) are registered via `DEFINE_AND_REGISTER_BINARY`/`REDUCE` macros, which auto-register both CPU and CUDA kernels. The CUDA variant copies data to CPU, computes, then copies back. Custom CUDA kernels can be added in `src/view_cuda.cu`. ## Project Structure ``` tinytensor/ ├── include/ │ ├── tensor.h # Tensor class (shape/strides as NdArray) │ ├── tensor_view.h # TensorView (zero-copy) │ ├── device.h # Device, StreamHandle, Event │ ├── kernel_registry.h # (op, dtype, device) kernel dispatch │ └── common.h # Types, Shape, Strides aliases ├── src/ │ ├── core.cpp # Tensor constructors, methods │ ├── ops.cpp # Element-wise ops + bmm + @ operator │ ├── device.cpp # HostMalloc, DeviceMalloc, streams │ ├── view_cuda.cu # CUDA strided copy, contiguous kernels │ └── shape_utils.cpp # Shape/Strides <-> Tensor conversion ├── python/ │ ├── tensor.cpp # pybind11 bindings │ └── tinytensor/ # Python package ├── pytest/ │ └── test_cuda_stream.py # 37 tests (device, stream, bmm, numpy) └── test/ ├── test_basic.cpp # 16 C++ unit tests ├── test_extension.cpp # 17 custom dtype tests └── test_cuda.cpp # 18 CUDA tests ``` ## Testing ```bash # C++ tests ./build/bin/tinytensor_tests ./build/bin/extension_tests ./build/bin/cuda_tests # Python tests (CUDA required for stream/bmm tests) PYTHONPATH=build/python python3 -m pytest pytest/test_cuda_stream.py -v PYTHONPATH=build/python python3 -m pytest pytest/test_numpy_compat.py -v ``` ## License MIT License