# checkpoint-replicator **Repository Path**: mirrors_GoogleCloudPlatform/checkpoint-replicator ## Basic Information - **Project Name**: checkpoint-replicator - **Description**: High Scale Checkpointing Replicator - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-19 - **Last Updated**: 2026-02-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # High Scale Checkpointing Replicator This repository contains the source code for High Scale Checkpointing Replicator for ML training jobs. ## Deployment There is a helper shell script to build and publish Docker image: `deploy\docker-build.sh []` ## Directory Organization * `deploy/`: Docker building * `src/replicator`: Replicator source code ## Documentation Checkpoint Replicator is designed to be hosted in multiple runtime environments. For using it on Google Kubernetes Engine (GKE) you don't have to build/deploy it yourself as fully-managed GKE addon is available. See the following docs (MTC stands for Multi-Tier Checkpointing): * [MTC BLOG post](https://cloud.google.com/blog/products/ai-machine-learning/using-multi-tier-checkpointing-for-large-ai-training-jobs)  * [MTC User Guide](https://cloud.google.com/kubernetes-engine/docs/how-to/machine-learning/training/multi-tier-checkpointing) * [MaxText MTC documentation](https://maxtext.readthedocs.io/en/latest/guides/checkpointing_solutions/multi_tier_checkpointing.html) (for training on TPUs) * [NeMo MTC recipe](https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/training/a3ultra/llama3-1-405b/nemo-pretraining-gke-resiliency/goodput-guide.md#2-multi-tier-checkpointing-strategy-leveraging-gcs-with-fuse) (for training on GPUs) Fully-managed [GKE hosting controller](https://github.com/GoogleCloudPlatform/high-scale-checkpointing-controller) is Open-Source as well.