# OmniStream **Repository Path**: gitstr/OmniStream ## Basic Information - **Project Name**: OmniStream - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-06-02 - **Last Updated**: 2026-06-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams Official implementation of **OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams**. [*Yibin Yan**](https://go2heart.github.io/), [*Jilan Xu**](https://jazzcharles.github.io/), [*Shangzhe Di*](https://dszdsz.cn/), [*Haoning Wu*](https://haoningwu3639.github.io/), [*Weidi Xie*](https://weidixie.github.io/) (*: equal contribution)
Website Arxiv Huggingface
## TODO - [ ] Release pre-training code. - [ ] Release our VLM&VLA code. ## Quick Start ### Installation ```bash git clone https://github.com/Go2Heart/OmniStream.git cd OmniStream conda create -n omnistream python=3.10 -y conda activate omnistream pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 pip install transformers==4.56.1 ``` ### Pre-trained Model Usage We have uploaded our pre-trained model to [🤗huggingface](https://huggingface.co/StreamFormer/OmniStream). #### Inference Usage ```python from model import OmnistreamMultiFrameTransformer from transformers import AutoImageProcessor processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream") model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda") import torch import numpy as np model.eval() fake_pixel = np.random.randn(16, 512, 512, 3) # BxT, H, W, C fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda") # BxT, H, W, C fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float() # B, T, H, W, C with torch.no_grad(): output = model(**fake_input, return_dict=True) print(output.keys()) print(output["last_hidden_state"].shape) # last layer's hidden states print(output["hidden_states"][-1].shape) # last layer's hidden states print(output["pooler_output"].shape) # cls token print(output["patch_start_idx"]) # index of the first patch of each frame (1x[cls] + 4x[reg]) ``` ## Citations If you find our work useful, please cite: ```bibtex @article{yan2026omnistreamm title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams}, author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie}, journal={arXiv preprint arXiv:2603.12265}, year={2026}, url={https://arxiv.org/abs/2603.12265} } ```