# tinytensor

**Repository Path**: zhengankun/tinytensor

## Basic Information

- **Project Name**: tinytensor
- **Description**: C++实现的tensor库
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 3
- **Forks**: 1
- **Created**: 2025-07-05
- **Last Updated**: 2026-04-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# TinyTensor

A lightweight, NumPy-like tensor library for C++ and Python with **CUDA GPU support**, **stream management**, and **zero-copy tensor views**.

## Features

- **NumPy-like API**: `tt.array()`, `tt.zeros()`, `tt.ones()`, broadcasting, reductions, `@` operator
- **CUDA Support**: Device-to-device transfers, kernel dispatch, stream management
- **CuPy-like Stream API**: `Stream()`, `with stream:`, `stream.record()`, `Event`, `stream.wait_event()`
- **Device Context Manager**: `with tt.Device(0):` for multi-GPU programming
- **Tensor Views**: Zero-copy `slice()`, `as_strided()`, `reshape()`
- **Custom dtypes**: Extensible type system with auto-registered kernels

## Quick Start

### Python

```python
import tinytensor as tt
import numpy as np

# Create tensors
a = tt.array([[1.0, 2.0], [3.0, 4.0]], dtype=tt.float32)
b = tt.array([[5.0, 6.0], [7.0, 8.0]], dtype=tt.float32)

# Arithmetic
c = a + b * 2.0

# Matrix multiply (@ operator)
a2d = tt.array(np.random.rand(3, 4).astype(np.float32))
b2d = tt.array(np.random.rand(4, 2).astype(np.float32))
c2d = a2d @ b2d  # (3, 2)

# Batch matrix multiply
a3d = tt.array(np.random.rand(2, 3, 4).astype(np.float32))
b3d = tt.array(np.random.rand(2, 4, 5).astype(np.float32))
c3d = a3d @ b3d  # (2, 3, 5)

# Reductions
total = a.sum()
row_sum = a.sum(axis=1)

# GPU computing
if tt.is_cuda_available():
    with tt.Device(0):
        a_gpu = tt.array(np.random.rand(1000, 1000).astype(np.float32))
        b_gpu = tt.array(np.random.rand(1000, 1000).astype(np.float32))
        c_gpu = a_gpu @ b_gpu
        c_cpu = c_gpu.to('cpu').to_numpy()
```

### C++

```cpp
#include "tinytensor.h"
using namespace tinytensor;

Tensor a = Tensor::Zeros(kFloat32, Shape{3, 4});
Tensor b = Tensor::Ones(kFloat32, Shape{4, 2});

// Broadcasting
Tensor c = a + 1.0f;

// Batch matmul
Tensor x(kFloat32, Shape{2, 3, 4});
Tensor y(kFloat32, Shape{2, 4, 5});
Tensor z = DISPATCH_BINARY(bmm, x, y);

// GPU
if (IsCudaAvailable()) {
    Tensor a_gpu(kFloat32, Shape{100, 100}, Device::CUDA(0));
    a_gpu.fill(1.0f);
    Tensor a_cpu = a_gpu.to(Device::CPU());
}
```

## Installation

### From source

```bash
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DTINYTENSOR_BUILD_PYTHON=ON -DTINYTENSOR_USE_CUDA=ON -DPython_EXECUTABLE=$(which python3)
make -j$(nproc)
```

### pip install

```bash
pip install .
```

## API Reference

### Tensor Creation

| Function | Description |
|----------|-------------|
| `tt.array(data, dtype)` | Create from lists or numpy arrays |
| `tt.zeros(shape, dtype)` | Zero-filled tensor |
| `tt.ones(shape, dtype)` | One-filled tensor |
| `tt.empty(shape, dtype)` | Uninitialized tensor |
| `tt.arange(start, stop, step)` | Range tensor |
| `tt.linspace(start, stop, num)` | Linearly spaced values |
| `tt.eye(n, m)` | Identity matrix |

### Tensor Properties

| Property | Description |
|----------|-------------|
| `t.shape` | Shape as tuple |
| `t.ndim` | Number of dimensions |
| `t.size` | Total number of elements |
| `t.dtype` | Data type constant |
| `t.device` | Device string (e.g. `"cpu"`, `"cuda:0"`) |

### Operations

| Operation | Description |
|-----------|-------------|
| `a + b`, `a - b`, `a * b`, `a / b` | Element-wise arithmetic |
| `a @ b` | Matrix multiply / dot / batch matmul |
| `a.sum(axis)`, `a.mean(axis)` | Reductions |
| `a.min()`, `a.max()` | Min/max |
| `tt.exp(a)` | Exponential |
| `a == b`, `a < b`, `a > b` | Comparisons |
| `a.reshape(shape)` | Reshape |
| `a.flatten()` | Flatten to 1D |
| `a.transpose(axes)` | Transpose |
| `tt.concatenate([a, b], axis)` | Concatenate |
| `tt.stack([a, b], axis)` | Stack |

### Device Management

| Function | Description |
|----------|-------------|
| `tt.is_cuda_available()` | Check CUDA availability |
| `tt.get_cuda_device_count()` | Number of GPUs |
| `tt.set_current_device('cuda:0')` | Set default device |
| `tt.get_current_device()` | Get current device string |

### Stream Management (CuPy-like)

| API | Description |
|-----|-------------|
| `tt.Stream()` | Create stream on current device |
| `tt.Stream(non_blocking=True)` | Non-blocking stream |
| `with stream:` | Context manager — use stream for operations in block |
| `stream.use()` | **(deprecated)** Set as current stream |
| `stream.synchronize()` | Wait for stream completion |
| `stream.done` | Check if stream is idle |
| `stream.record()` | Record an `Event` on this stream |
| `stream.wait_event(event)` | Wait for event before executing |
| `stream.wait(other)` | Wait for another stream |
| `tt.null_stream()` | Get default stream singleton |
| `tt.synchronize()` | Synchronize current stream |

### Events

| API | Description |
|-----|-------------|
| `tt.Event()` | Create event on current device |
| `event.record()` | Record on current stream |
| `event.wait(stream)` | Make stream wait for event |
| `event.done` | Check if event has been recorded |
| `event.synchronize()` | Block CPU until event is recorded |

### Multi-device Example

```python
stream0 = None
stream1 = None

with tt.Device(0):
    stream0 = tt.Stream()

with tt.Device(1):
    stream1 = tt.Stream()

with tt.Device(0), stream0:
    data0 = tt.array(np.ones((100, 100), dtype=np.float32))
    result0 = data0 @ data0

with tt.Device(1), stream1:
    data1 = tt.array(np.ones((100, 100), dtype=np.float32))
    result1 = data1 @ data1

stream0.synchronize()
stream1.synchronize()
```

## Supported dtypes

| Constant | Python | NumPy dtype |
|----------|--------|-------------|
| `tt.float32` | float32 | `np.float32` |
| `tt.float64` | float64 | `np.float64` |
| `tt.int32` | int32 | `np.int32` |
| `tt.int64` | int64 | `np.int64` |
| `tt.bool_` | bool | `np.bool_` |
| `tt.float16` | float16 | `np.float16` |

## CUDA Kernel Registration

All element-wise operations (add, sub, mul, div, comparisons) and reductions (sum, mean, min, max) are registered via `DEFINE_AND_REGISTER_BINARY`/`REDUCE` macros, which auto-register both CPU and CUDA kernels. The CUDA variant copies data to CPU, computes, then copies back. Custom CUDA kernels can be added in `src/view_cuda.cu`.

## Project Structure

```
tinytensor/
├── include/
│   ├── tensor.h          # Tensor class (shape/strides as NdArray)
│   ├── tensor_view.h     # TensorView (zero-copy)
│   ├── device.h          # Device, StreamHandle, Event
│   ├── kernel_registry.h # (op, dtype, device) kernel dispatch
│   └── common.h          # Types, Shape, Strides aliases
├── src/
│   ├── core.cpp          # Tensor constructors, methods
│   ├── ops.cpp           # Element-wise ops + bmm + @ operator
│   ├── device.cpp        # HostMalloc, DeviceMalloc, streams
│   ├── view_cuda.cu      # CUDA strided copy, contiguous kernels
│   └── shape_utils.cpp   # Shape/Strides <-> Tensor conversion
├── python/
│   ├── tensor.cpp        # pybind11 bindings
│   └── tinytensor/       # Python package
├── pytest/
│   └── test_cuda_stream.py  # 37 tests (device, stream, bmm, numpy)
└── test/
    ├── test_basic.cpp    # 16 C++ unit tests
    ├── test_extension.cpp # 17 custom dtype tests
    └── test_cuda.cpp     # 18 CUDA tests
```

## Testing

```bash
# C++ tests
./build/bin/tinytensor_tests
./build/bin/extension_tests
./build/bin/cuda_tests

# Python tests (CUDA required for stream/bmm tests)
PYTHONPATH=build/python python3 -m pytest pytest/test_cuda_stream.py -v
PYTHONPATH=build/python python3 -m pytest pytest/test_numpy_compat.py -v
```

## License

MIT License