# RLDX-1 **Repository Path**: nilbody_0/btaRLDX1 ## Basic Information - **Project Name**: RLDX-1 - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-26 - **Last Updated**: 2026-05-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

# RLDX-1 [[Paper]](https://arxiv.org/abs/2605.03269) [[Project Page]](https://rlwrld.ai/rldx-1) [[Models]](https://huggingface.co/collections/RLWRLD/rldx-1) RLDX-1 overview

--- RLDX-1 is a Vision-Language-Action model (VLA) for human-like dexterous manipulation. Beyond the *versatile intelligence* inherited from pre-trained VLM backbones, RLDX-1 adds three **functional capabilities** — motion awareness, long-term memory, and physical sensing — through a unified **Multi-Stream Action Transformer (MSAT)** architecture, a synthetic-augmented training pipeline, and a real-time inference stack. ---

## Highlights - **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and action each get a dedicated stream coupled by joint self-attention — an extension of MM-DiT to action modeling. - **Motion awareness.** Multi-frame observations + a motion module capture temporal dynamics; intermediate VLM layers compress video tokens to keep the policy efficient. - **Long-term memory.** A memory module fuses past cognition features with the current ones for history-grounded decisions beyond a short multi-frame window. - **Physical sensing.** Tactile and torque enter as a dedicated physics stream; the decoder is jointly trained to predict future physical signals. - **Three-stage training.** Pre-training (generalization) → mid-training (functionality) → post-training (task adaptation), with synthetic data augmenting rare manipulation scenarios. - **Real-time inference.** Static graph capture + custom fused kernels bring the all-modality model to **43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz)**. --- ## Performance ### Simulation Benchmarks Success rates (%) of RLDX-1 fine-tuned on each benchmark's training set, compared to recent frontier VLA baselines. | Method | LIBERO (Avg) | LIBERO-Plus | SIMPLER Google-VM | SIMPLER Google-VA | SIMPLER WidowX | RoboCasa Kitchen | GR-1 Tabletop | RoboCasa365 (Avg) | |---|---|---|---|---|---|---|---|---| | π0-FAST | 85.5 | 64.2 | 61.9 | 59.0 | 48.3 | 63.6 | — | 21.7 | | π0 | 94.1 | 54.6 | 58.8 | 54.8 | 27.1 | 62.5 | 13.6 | 14.8 | | π0.5 | 96.9 | 86.5 | 72.7 | 68.4 | 46.9 | 62.1 | 15.4 | 16.9 | | GR00T N1.5 | 86.5 | 66.3 | 52.4 | 43.7 | 62.0 | 65.7 | 48.0 | 20.0 | | GR00T N1.6 | 96.7 | 72.6 | 76.1 | 57.1 | 57.1 | 66.2 | 47.6 | 26.9 | | **RLDX-1 (ours)** | **97.8** | **86.7** | **81.5** | **77.4** | **71.9** | **70.6** | **58.7** | **32.1** | The first five columns cover the established LIBERO / SIMPLER family; the last three (RoboCasa Kitchen, GR-1 Tabletop, RoboCasa365) are long-horizon, humanoid, and compositional benchmarks. Per-benchmark checkpoints, embodiment tags, and reproduce commands are listed under [Reproducing Benchmark Results](#reproducing-benchmark-results). --- ## Installation **Requirements**: Python 3.10, CUDA 12.x, [uv](https://github.com/astral-sh/uv) v0.8.4+ ```bash git clone https://github.com/RLWRLD/RLDX-1.git cd RLDX-1 uv sync --python 3.10 uv pip install -e . ``` Verify installation: ```bash uv run python -c "import rldx; print(rldx.__version__)" ``` For simulator setup, dev tooling, and full troubleshooting, see [`docs/installation.md`](docs/installation.md). --- ## Documentation Hands-on guides live under [`docs/`](docs/): | Guide | What it covers | |---|---| | [`installation.md`](docs/installation.md) | Environment setup, simulator venvs, dev tooling, common pitfalls | | [`architecture.md`](docs/architecture.md) | Five-stage walkthrough of the RLDX-1 model and its config flags | | [`training.md`](docs/training.md) | `launch_train.py` recipes (fine-tune / mid-train), LoRA, training-time RTC, dataset layout | | [`embodiment_tags.md`](docs/embodiment_tags.md) | What `EmbodimentTag` is and how to pick one for a custom robot | | [`evaluation.md`](docs/evaluation.md) | RoboCasa / LIBERO / SIMPLER / GR-1 eval, server + rollout split, results aggregation | | [`inference_server.md`](docs/inference_server.md) | `run_rldx_server.py` CLI, wire protocol, RTC modes, `--compile` levels, simulator + real-robot deployment | --- ## Pretrained & Midtrained Checkpoints | Checkpoint | Description | Params | HuggingFace | |-----------|-------------|--------|-------------| | `RLDX-1-PT` | Pre-trained (video) | 6.9B | [RLWRLD/RLDX-1-PT](https://huggingface.co/RLWRLD/RLDX-1-PT) | | `RLDX-1-MT-DROID` | Mid-trained on DROID with all add-ons | 8.1B | [RLWRLD/RLDX-1-MT-DROID](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | | `RLDX-1-MT-ALLEX` | Mid-trained on ALLEX with all add-ons | 8.1B | [RLWRLD/RLDX-1-MT-ALLEX](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | --- ## Data Preparation RLDX-1 uses [LeRobot](https://github.com/huggingface/lerobot) v2.1 format datasets. To convert your data: ```bash # Convert a single dataset bash run_scripts/data/convert_lerobot_single.sh /path/to/your/data # Convert multiple datasets bash run_scripts/data/convert_lerobot_multiple.sh /path/to/data/root ``` Each dataset must carry a `meta/modality.json` that slices the flat state / action vectors into named joint groups and remaps video columns to modality keys. Schema and a worked example are in [`docs/training.md`](docs/training.md#dataset-layout-metamodalityjson). ### Custom Embodiment Config Define your robot's modality configuration: ```python # my_modality_config.py from rldx.data.types import ModalityConfig MODALITY_CONFIGS = { "my_robot": { "image": ModalityConfig(...), "state": ModalityConfig(...), "action": ModalityConfig(...), } } ``` Pass it via `--modality-config-path my_modality_config.py` during training, together with an `EmbodimentTag` that selects the per-robot MLP head slot (default: `GENERAL_EMBODIMENT`; see [`docs/embodiment_tags.md`](docs/embodiment_tags.md) for the picker). The `EmbodimentTag` design and per-embodiment MLP head structure follow the convention introduced by [NVIDIA GR00T N1.7](https://github.com/NVIDIA/Isaac-GR00T/tree/n1.7-release). --- ## Fine-tuning This section covers how to fine-tune RLDX-1 from a pre-trained checkpoint (`RLWRLD/RLDX-1-PT`) on your own LeRobot v2.1 dataset. The training entry point is a single CLI (`rldx/experiment/launch_train.py`) where flags toggle the optional functional capabilities described in [Highlights](#highlights): - `--video-length N` — temporal frames per observation (motion awareness) - `--use-memory` — temporal memory module (long-term memory) - `--use-motion` — motion module inside the VLM backbone - `--use-physics --physics-keys ...` — tactile / torque streams (physical sensing) LoRA, training-time RTC, and the full flag list are documented in [`docs/training.md`](docs/training.md). Below are the canonical recipes. ### Single dataset, no add-ons ```bash uv run python rldx/experiment/launch_train.py \ --base-model-path RLWRLD/RLDX-1-PT \ --dataset-path /path/to/your/dataset \ --embodiment-tag GENERAL_EMBODIMENT \ --video-length 4 \ --n-cog-tokens 64 \ --global-batch-size 64 \ --learning-rate 1e-4 \ --max-steps 60000 \ --save-steps 5000 \ --output-dir ./outputs/my_finetune ``` ### With all add-ons (memory + motion + physics) Recommended for embodiments where memory, motion awareness, or contact sensing matter. To enable a *single* add-on instead of all three, keep just the corresponding `--use-*` flag(s) and drop the rest. ```bash uv run python rldx/experiment/launch_train.py \ --base-model-path RLWRLD/RLDX-1-PT \ --dataset-path /path/to/your/dataset \ --embodiment-tag GENERAL_EMBODIMENT \ --video-length 4 \ --use-memory --memory-length 4 --concat-memory \ --use-motion --motion-insert-layer 9 \ --use-physics --physics-keys tactile torque --physics-dims 30 7 \ --new-param-warmup-steps 2000 \ --n-cog-tokens 64 \ --global-batch-size 64 \ --max-steps 60000 \ --output-dir ./outputs/my_finetune_all ``` ### Key Training Flags | Flag | Description | Default | |------|-------------|---------| | `--video-length` | Number of video frames (video token compression is always on; set to `1` for single-frame) | `4` | | `--video-stride` | Stride between frames in action-step units | `2` | | `--use-memory` | Enable temporal memory module | `False` | | `--memory-length` | Memory context window (timesteps) | `4` | | `--use-motion` | Enable motion module | `False` | | `--use-physics` | Enable physics signal conditioning | `False` | | `--n-cog-tokens` | Number of cognition tokens | `64` | | `--global-batch-size` | Total batch size across GPUs | `64` | | `--new-param-warmup-steps` | Warmup steps for newly added modules | `0` | ### LoRA fine-tuning For memory-constrained fine-tunes you can replace full-parameter tuning of the action model (MSAT) and/or the backbone VLM with PEFT LoRA adapters: ```bash --action-model-use-lora --action-model-lora-rank 16 --action-model-lora-alpha 32 --backbone-use-lora --backbone-lora-rank 16 --backbone-lora-alpha 32 --backbone-lora-num-layers -1 ``` `--action-model-use-lora` overrides `--tune-diffusion-model`; `--backbone-use-lora` overrides `--tune-top-llm-layers`. Full flag list and target-module defaults are in [`docs/training.md`](docs/training.md#lora-fine-tuning). ### Training-time Real-Time Chunking If you intend to serve the checkpoint with `--rtc-inference-mode trained` (faster, fullgraph-compatible), enable training-time RTC at training time: ```bash --rtc-training-max-delay 4 ``` The training-time RTC formulation follows [Black et al. (Training-Time Action Conditioning for Efficient Real-Time Chunking)](https://arxiv.org/abs/2512.05964); the inference-side counterpart is [Black et al. (Real-Time Execution of Action Chunking Flow Policies)](https://arxiv.org/abs/2506.07339). See [`docs/training.md`](docs/training.md#real-time-chunking-training-time) and [`docs/inference_server.md`](docs/inference_server.md#real-time-chunking-rtc) for usage details. --- ## Inference RLDX-1 ships two inference paths sharing the same model + processor: - **In-process** — load `RLDXPolicy` and call `get_action(obs)` directly from Python. Best for evaluation scripts and notebook prototyping. - **ZeroMQ server** — `rldx/eval/run_rldx_server.py` for real-robot deployment, with two orthogonal optimizations layered on top of the base path: - *Graph capture + kernel fusion* (`--compile {submodule, fullgraph}`) — static-graph CUDA-graph capture and custom fused operators bring the all-modality model to **43.7 ms / step on RTX 5090** (1.63× speedup over PyTorch eager, >22 Hz). - *Real-Time Chunking* (`--rtc-inference-mode {guided, trained}`) — chunk-boundary stitching for smooth action handoff between consecutive chunks. ### Quick Start ```python import torch from rldx.policy.rldx_policy import RLDXPolicy from rldx.data.embodiment_tags import EmbodimentTag policy = RLDXPolicy( model_path="RLWRLD/RLDX-1-FT-ROBOCASA", embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT, device="cuda:0", ) # Single-step inference action = policy.get_action(observation) ``` ### Serving (ZeroMQ) For real-time robot deployment: ```bash # Start the policy server uv run python rldx/eval/run_rldx_server.py \ --model-path RLWRLD/RLDX-1-FT-ROBOCASA \ --embodiment-tag GENERAL_EMBODIMENT \ --host 0.0.0.0 --port 20000 ``` ### Real-time inference (graph capture + RTC) The server brings the all-modality model to **43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz)** through two orthogonal knobs: **`--compile {none, submodule, fullgraph}`** — graph capture + kernel fusion. - `submodule` — compiles each learnable sub-module. Preserves autograd. ~30 s warmup. - `fullgraph` — CUDA-graph capture and operator fusion over the full VLA forward. Lowest steady-state latency, ~90–210 s warmup. - Tuned for RTX 5090 (Blackwell, sm_120). On other GPU architectures use `--compile submodule` for the intended result. **`--rtc-inference-mode {none, guided, trained}`** — Real-Time Chunking for chunk-boundary stitching. - `guided` — works with any flow-matching checkpoint. - `trained` — requires a checkpoint trained with `--rtc-training-max-delay > 0`. Pairs with `--compile fullgraph`. - Implementation follows [Black et al. (Real-Time Execution of Action Chunking Flow Policies)](https://arxiv.org/abs/2506.07339). The `trained` mode uses the integration from [Black et al. (Training-Time Action Conditioning for Efficient Real-Time Chunking)](https://arxiv.org/abs/2512.05964). The full flag list, the `compile × RTC` compatibility matrix, and a walkthrough of the trade-offs are in [`docs/inference_server.md`](docs/inference_server.md#real-time-chunking-rtc). --- ## Reproducing Benchmark Results Each benchmark has a self-contained eval README; this table maps each result row in [Performance](#simulation-benchmarks) to the fine-tuned checkpoint we used, the embodiment tag the server expects, and the runnable guide. | Benchmark | Fine-tuned Checkpoint | Embodiment Tag | Eval Guide | |---|---|---|---| | LIBERO | [RLWRLD/RLDX-1-FT-LIBERO](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | `GENERAL_EMBODIMENT` | [`run_scripts/eval/libero/README.md`](run_scripts/eval/libero/README.md) | | LIBERO-Plus | [RLWRLD/RLDX-1-FT-LIBERO](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | `GENERAL_EMBODIMENT` | [`run_scripts/eval/libero_plus/README.md`](run_scripts/eval/libero_plus/README.md) | | SimplerEnv Google | [RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | `OXE_FRACTAL` | [`run_scripts/eval/simpler/README.md`](run_scripts/eval/simpler/README.md) | | SimplerEnv WidowX | [RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | `OXE_BRIDGE_ORIG` | [`run_scripts/eval/simpler/README.md`](run_scripts/eval/simpler/README.md) | | GR-1 Tabletop | [RLWRLD/RLDX-1-FT-GR1](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | `GENERAL_EMBODIMENT` | [`run_scripts/eval/gr1_tabletop/README.md`](run_scripts/eval/gr1_tabletop/README.md) | | RoboCasa Kitchen (24 tasks) | [RLWRLD/RLDX-1-FT-ROBOCASA](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | `GENERAL_EMBODIMENT` | [`run_scripts/eval/robocasa_kitchen/README.md`](run_scripts/eval/robocasa_kitchen/README.md) | | RoboCasa365 | [RLWRLD/RLDX-1-FT-RC365](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | `GENERAL_EMBODIMENT` | [`run_scripts/eval/robocasa_365/README.md`](run_scripts/eval/robocasa_365/README.md) | Shared mechanics (server + rollout split, common flags, troubleshooting) are documented in [`docs/evaluation.md`](docs/evaluation.md). --- ## Project Structure ``` rldx/ ├── configs/ # Model, data, and training configurations ├── data/ # Dataset loaders, processors, and statistics ├── experiment/ # Training entry points and utilities ├── eval/ # Evaluation scripts and sim environments ├── inference/ # Inference engine: GraphSafe substrate, fused Triton kernels, RTC dispatch ├── model/ │ ├── core/ # Core model (RLDX-1, processor, setup) │ ├── modules/ │ │ ├── backbone/ # RLDX-1-VLM backbone (with video token compression) │ │ ├── action_model/ # MSAT diffusion action model + physics head │ │ ├── memory.py # Temporal memory transformer │ │ ├── norms.py # Shared normalization primitives │ │ └── embodiment_conditioned_mlp.py │ ├── pipeline.py # Training/inference pipeline glue │ └── registry.py # Embodiment + variant registry ├── policy/ # Inference policy wrappers └── utils/ # Distributed training utilities ``` --- ## Citation ```bibtex @article{rldx2026, title={RLDX-1 Technical Report}, author={Dongyoung Kim and Huiwon Jang and Myungkyu Koo and Suhyeok Jang and Taeyoung Kim and others}, year={2026}, journal={arXiv preprint arXiv:2605.03269}, eprint={2605.03269}, archivePrefix={arXiv} } ``` --- ## Acknowledgments RLDX-1 builds upon the following open-source projects: - [NVIDIA GR00T N1.7](https://github.com/NVIDIA/Isaac-GR00T/tree/n1.7-release) — Training Codebase - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) — Vision-language backbone - [FLUX](https://github.com/black-forest-labs/flux) — MMDiT architecture ## License - **Code**: released under the [Apache License 2.0](LICENSE.md). The codebase is built on the [NVIDIA Isaac GR00T N1.7](https://github.com/NVIDIA/Isaac-GR00T/tree/n1.7-release) framework — third-party attributions and per-file provenance headers are preserved in the source tree. - **Model weights**: distributed on Hugging Face under the [RLWRLD Model License v1.0](https://huggingface.co/RLWRLD/RLDX-1-PT/blob/main/LICENSE.md) (a non-commercial license with attribution and share-alike terms). By using any `RLWRLD/RLDX-1-*` checkpoint you agree to those terms. ## Contributions We currently do not accept external pull requests on this repository. If you encounter a bug, broken reproduction step, or have a question about RLDX-1, please **open an issue** at [github.com/RLWRLD/RLDX-1/issues](https://github.com/RLWRLD/RLDX-1/issues) and we will follow up there.