# tabasco
**Repository Path**: ahlih_admin/tabasco
## Basic Information
- **Project Name**: tabasco
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-16
- **Last Updated**: 2025-07-16
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
A Fast, Simplified Model for Molecular Generation with Improved Physical Quality
Carlos Vonessen*,
Charles Harris*,
Miruna Cretu*,
Pietro Liò
GenBio Workshop @ ICML 2025, *Core contributor
[](https://arxiv.org/pdf/2507.00899)
[](https://x.com/carlos_vonessen/status/1940671990647726539)
[](https://github.com/astral-sh/ruff)
[](https://github.com/pre-commit/pre-commit)
## Main Contributions:
* State-of-the-art performance on PoseBusters ([link](https://paperswithcode.com/sota/unconditional-molecule-generation-on-geom))
* 10x speed-up at sampling time (see Table 1)
* More parameter efficient (see Figure 1)
* Standard non-equivariant Transformer
* Lean and extensible implementation
## Getting Started
**Introduction to Repo:** This repository is based on the [lightning hydra template](https://github.com/ashleve/lightning-hydra-template), where you can find an introduction on hydra for pytorch and general usage instructions.
**Downloading datasets:** The processed datasets are available for [GEOM-Drugs](https://huggingface.co/datasets/carlosinator/tabasco-geom-drugs) and [QM9](https://huggingface.co/datasets/carlosinator/tabasco-qm9). Move all splits to `src/data` without renaming. Running `src/train.py` for the first time will generate the lmdb dataset, which only happens once and can take about an hour.
**Checkpoints:** We currently provide checkpoints for two models trained on GEOM-Drugs: [TABASCO-mild (3.7M)](https://huggingface.co/carlosinator/tabasco-geom-mild) and [TABASCO-hot (15M)](https://huggingface.co/carlosinator/tabasco-geom-hot). More to follow!
### Installation
```bash
conda env create -f environment.yaml
conda activate tabasco
```
### Training
The training configs are available under `configs/experiment`, which overwrite the defaults in the other `configs/*` folders. To train the `TABASCO-hot` model from the paper, you can run:
```python
python src/train.py experiment=hot_geom trainer=gpu
```
**Multi-GPU Training** is available via `torchrun` and trainer parameters are customizable in `configs/trainer`. You may want to pass additional command line arguments to `torchrun` depending on your setup. For example for two GPUs on one node using DDP (assuming a suitable `ddp.yaml` config) you can run
```python
torchrun --nproc_per_node=2 --nnodes=1 src/train.py experiment=hot_geom trainer=ddp
```
### Sampling
We provide two scripts for sampling from a model checkpoint, as well as some convenient parameters to modify. Unconditional sampling is called with:
```python
python src/sample.py \
--num_mols 1000 --num_steps 100 \
--checkpoint path/to/model.ckpt \
--output_path path/to/output/folder
```
**Boosting Physical Plausibility**: This is a script for sampling molecules with boosted physical quality (Section 3.5). Where `guidance` encodes the step size of each gradient step, `step-switch` the point at which to switch to UFF bound guidance, and `to-center` whether to regress to the interval center.
```python
python src/sample_uff_bounds.py \
--guidance 0.01 --step-switch 90 --to-center False \
--ckpt path/to/model.ckpt --output-dir path/to/output/folder
```
## Repository Summary
### Model Architecture
The model uses a deliberately simplified non-equivariant Transformer that treats molecular generation as a sequence modeling problem (see the [positional encodings](src/pocketsynth/models/components/positional_encoder.py)). Coordinates and atom types are jointly embedded with time and positional encodings, then processed through standard [Transformer blocks](src/pocketsynth/models/components/transformer.py). No explicit bond information is included and the model relies on generating physically sensible coordinates so that standard chemoinformatics tools can infer bonds reliably. Optional cross-attention layers allow separate processing of coordinate and atom type domains before final MLP heads predict the outputs. The full [model implementation](src/pocketsynth/models/components/transformer_module.py) is easily extensible compared to specialized equivariant architectures.
### Interpolant Class
We combine the required interpolant functionality in one base `Interpolant` class to make the code more readable and extensible. In practice, we found that this significantly increases iteration speed and improves verifiability. The `SDEMetricInterpolant` manages coordinate flows with configurable noise scaling and centering, while `DiscreteInterpolant` handles categorical atom types in the discrete diffusion framework. Each interpolant defines four key operations: noise sampling, path creation between data points, loss computation, and explicit-Euler stepping during generation. This modular design allows mixing different interpolation strategies for different molecular properties while maintaining a unified training loop.
## Citation
```
@article{vonessen2025tabasco,
title={TABASCO: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality},
author={Carlos Vonessen and Charles Harris and Miruna Cretu and Pietro Liò},
year={2025},
url={https://arxiv.org/abs/2507.00899},
}
```