# node-readiness-controller **Repository Path**: mirrors_kubernetes-sigs/node-readiness-controller ## Basic Information - **Project Name**: node-readiness-controller - **Description**: This repository contains a reference implementation of NodeReadinessGates as a Kubernetes controller that manages node taints based on multiple readiness gate conditions, providing fine-grained control over when nodes are ready to accept workloads. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-30 - **Last Updated**: 2025-11-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # node-readiness-controller > [WIP] This readme is being updated. This repository contains a reference implementation of [NodeReadinessGates](https://github.com/kubernetes/enhancements/pull/5416) as a Kubernetes controller that manages node taints based on multiple readiness gate conditions, providing fine-grained control over when nodes are ready to accept workloads. ## Goals - Enhance scheduling accuracy by leveraging standardized Node readiness conditions. - Increase the precision of AutoScaling operations. - Provide improved observability into Node status and critical component health. ## Overview The Node Readiness Controller extends Kubernetes' node readiness model by allowing you to define custom readiness rules that evaluate multiple node conditions simultaneously. It automatically manages node taints to prevent scheduling until all specified conditions are satisfied. ### Demo **Node Readiness concept in Kind cluster** TODO:(ajaysundark) figure out how to embed video demo! https://drive.google.com/file/d/1Q2vCU7FYUrEkHDeQV5NneCH_MNoFshvH/view?usp=sharing ### Key Features - **Multi-condition Rules**: Define rules that require ALL specified conditions to be satisfied - **Flexible Enforcement**: Support for bootstrap-only and continuous enforcement modes - **Conflict Prevention**: Validation webhook prevents conflicting taint configurations - **Dry Run Mode**: Preview rule impact before applying changes - **Comprehensive Status**: Detailed observability into rule evaluation and node status - **Node Targeting**: Use label selectors to target specific node types - **Bootstrap Completion Tracking**: Prevents re-evaluation once bootstrap conditions are met ## Architecture ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ NodeReadiness │ │ ReadinessGate │ │ Validation │ │ GateRule CRD │────▶ Controller │ │ Webhook │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ ▼ │ ┌─────────────────┐ │ │ Node Taints │ │ │ Management │ │ └─────────────────┘ │ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Kubernetes │ │ Rule Conflict │ │ Nodes │ │ Detection │ └─────────────────┘ └─────────────────┘ ``` Detailed Flow: ```mermaid graph TB %% CRD and Rules CRD[NodeReadinessGateRule CRD] --> RuleRec[RuleReconciler] %% Controller Components RuleRec --> Cache[Rule Cache] NodeRec[NodeReconciler] --> Cache Cache --> Controller[ReadinessGateController] %% Node Processing Nodes[Kubernetes Nodes] --> NodeRec NodeRec --> TaintMgmt[Taint Management] Controller --> TaintMgmt TaintMgmt --> Nodes %% Validation CRD --> Webhook[Validation Webhook] Webhook --> Validation[Conflict Detection
& Rule Validation] %% External Systems NPD[Node Problem Detector
& Health Checkers] --> Conditions[Node Conditions] Conditions --> Nodes %% Status and Observability Controller --> Status[Status Updates] Status --> CRD %% Enforcement Modes Controller --> Bootstrap{Bootstrap Mode?} Bootstrap -->|Yes| BootstrapLogic[One-time Taint Removal
+ Annotation Marking] Bootstrap -->|No| ContinuousLogic[Continuous Monitoring
+ Taint Management] %% Styling classDef crd fill:#e1f5fe classDef controller fill:#f3e5f5 classDef external fill:#e8f5e8 classDef validation fill:#fff3e0 class CRD crd class RuleRec,NodeRec,Controller,Cache,TaintMgmt controller class NPD,Nodes,Conditions external class Webhook,Validation validation ``` ### Core Components #### 1. NodeReadinessGateRule CRD - Defines rules mapping multiple node conditions to a single taint - Supports bootstrap-only and continuous enforcement modes - Allows node selector targeting and grace periods #### 2. ReadinessGateController - **RuleReconciler**: Processes rule changes and updates internal cache - **NodeReconciler**: Handles node condition changes and evaluates applicable rules - Manages taint addition/removal based on condition satisfaction #### 3. [WIP] Validation Webhook - Prevents conflicting rules (same taint key with overlapping node selectors) - Validates rule specifications and required fields - Ensures system consistency and prevents misconfigurations #### 4. Integration with Node Problem Detector (NPD) Works seamlessly with NPD or any system that sets node conditions: - NPD plugins update node conditions (e.g., `network.kubernetes.io/CNIReady`) - Controller watches condition changes and evaluates rules - Supports custom conditions from any component ## Getting Started ### Quick Start Examples #### Example 1: Storage Readiness Rule (Bootstrap-only) This rule ensures nodes have working storage before removing the storage readiness taint: ```yaml apiVersion: nodereadiness.io/v1alpha1 kind: NodeReadinessGateRule metadata: name: storage-readiness-rule spec: conditions: - type: "storage.kubernetes.io/CSIReady" requiredStatus: "True" - type: "storage.kubernetes.io/VolumePluginReady" requiredStatus: "True" taint: key: "readiness.k8s.io/StorageReady" effect: "NoSchedule" value: "pending" enforcementMode: "bootstrap-only" nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" ``` #### Example 2: Network Readiness Rule (Continuous) This rule continuously monitors network connectivity: ```yaml apiVersion: nodereadiness.io/v1alpha1 kind: NodeReadinessGateRule metadata: name: network-readiness-rule spec: conditions: - type: "network.kubernetes.io/NetworkReady" requiredStatus: "True" taint: key: "readiness.k8s.io/NetworkReady" effect: "NoSchedule" enforcementMode: "continuous" gracePeriod: "60s" dryRun: true # Preview mode ``` ### Rule Specification | Field | Description | Required | |-------|-------------|----------| | `conditions` | List of node conditions that must ALL be satisfied | Yes | | `conditions[].type` | Node condition type to evaluate | Yes | | `conditions[].requiredStatus` | Required condition status (`True`, `False`, `Unknown`) | Yes | | `taint.key` | Taint key to manage | Yes | | `taint.effect` | Taint effect (`NoSchedule`, `PreferNoSchedule`, `NoExecute`) | Yes | | `taint.value` | Optional taint value | No | | `enforcementMode` | `bootstrap-only` or `continuous` | Yes | | `nodeSelector` | Label selector to target specific nodes | No | | `gracePeriod` | Grace period before applying taint changes | No | | `dryRun` | Preview changes without applying them | No | ### Enforcement Modes #### Bootstrap-only Mode - Removes bootstrap taint when conditions are first satisfied - Marks completion with node annotation - Stops monitoring after successful removal (fail-safe) - Ideal for one-time setup conditions (storage, installing node daemons e.g: security agent or kernel-module update) #### Continuous Mode - Continuously monitors conditions - Adds taint when any condition becomes unsatisfied - Removes taint when all conditions become satisfied - Ideal for ongoing health monitoring (network connectivity, resource availability) ## Deployment ### Option 1: Using Make Commands **Build and push your image to the location specified by `IMG`:** ```sh make docker-build docker-push IMG=/nrgcontroller:tag ``` **Install the CRDs into the cluster:** ```sh make install ``` **Deploy the Manager to the cluster with the image specified by `IMG`:** ```sh make deploy IMG=/nrgcontroller:tag ``` **Create sample rules:** ```sh kubectl apply -k config/samples/ ``` ### Option 2: Using Kustomize Directly ```sh # Install CRDs kubectl apply -k config/crd # Deploy controller with RBAC kubectl apply -k config/default # Apply sample rules kubectl apply -f examples/network-readiness-rule.yaml ``` ### Verification Check that the controller is running: ```sh kubectl get pods -n nrgcontroller-system kubectl logs -n nrgcontroller-system deployment/nrgcontroller-controller-manager ``` Verify CRDs are installed: ```sh kubectl get crd nodereadinessgaterules.nodereadiness.io ``` ## Operations ### Monitoring Rule Status View rule status and evaluation results: ```sh # List all rules kubectl get nodereadinessgaterules # Detailed status of a specific rule kubectl describe nodereadinessgaterule network-readiness-rule # Check rule evaluation per node kubectl get nodereadinessgaterule network-readiness-rule -o yaml ``` The status includes: - `appliedNodes`: Nodes this rule targets - `failedNodes`: Nodes with evaluation errors - `nodeEvaluations`: Per-node condition evaluation results - `dryRunResults`: Impact analysis for dry-run rules ### Dry Run Mode Test rules safely before applying: ```yaml spec: dryRun: true # Enable dry run mode conditions: - type: "storage.kubernetes.io/CSIReady" requiredStatus: "True" # ... rest of spec ``` Check dry run results: ```sh kubectl get nodereadinessgaterule -o jsonpath='{.status.dryRunResults}' ``` ### Bootstrap Completion Tracking For bootstrap-only rules, completion is tracked via node annotations: ```sh # Check if bootstrap completed for a node kubectl get node -o jsonpath='{.metadata.annotations}' # Look for: readiness.k8s.io/bootstrap-completed-=true ``` ### Troubleshooting #### Common Issues 1. **Rule conflicts**: Multiple rules targeting the same taint key ```sh # Check validation webhook logs kubectl logs -n nrgcontroller-system deployment/nrgcontroller-controller-manager | grep webhook ``` 2. **Missing node conditions**: Rules waiting for conditions that don't exist ```sh # Check node conditions kubectl describe node | grep Conditions -A 20 # Check rule evaluation status kubectl get nodereadinessgaterule -o yaml | grep nodeEvaluations -A 50 ``` 3. **RBAC issues**: Controller can't update nodes or rules ```sh # Check controller logs for permission errors kubectl logs -n nrgcontroller-system deployment/nrgcontroller-controller-manager # Verify RBAC kubectl describe clusterrole nrgcontroller-manager-role ``` #### Debug Mode Enable verbose logging: ```sh # Edit controller deployment to add debug flags kubectl patch deployment -n nrgcontroller-system nrgcontroller-controller-manager \ -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","args":["--zap-log-level=debug"]}]}}}}' ``` ## Uninstallation **Delete all rule instances:** ```sh kubectl delete nodereadinessgaterules --all ``` **Delete the controller:** ```sh make undeploy ``` **Delete the CRDs:** ```sh make uninstall ``` ## Advanced Configuration ### Security Considerations The controller requires the following RBAC permissions: - **Nodes**: `get`, `list`, `watch`, `patch`, `update` (for taint management) - **NodeReadinessGateRules**: Full CRUD access - **Events**: `create` (for status reporting) ### Performance and Scalability - **Memory Usage**: ~64MB base + ~1KB per node + ~2KB per rule - **CPU Usage**: Minimal during steady state, scales with node/rule change frequency - **Node Scale**: Tested up to 100 nodes using kwok (1k nodes in progress) - **Rule Scale**: Recommended maximum 50 rules per cluster ### Integration Patterns #### With Node Problem Detector ```yaml # NPD checks and sets conditions, controller manages taints conditions: - type: "readiness.k8s.io/NetworkReady" # Set by NPD requiredStatus: "False" ``` #### With Custom Health Checkers ```yaml # Your daemonset sets custom conditions conditions: - type: "readiness.k8s.io/mycompany.example.com/DatabaseReady" requiredStatus: "True" - type: "readiness.k8s.io/mycompany.example.com/CacheWarmed" requiredStatus: "True" ``` #### With Cluster Autoscaler NRG controller work well with cluster autoscaling: - New nodes start with restrictive taints - Controller removes taints once conditions are satisfied - Autoscaler can safely scale knowing nodes are truly ready ## Development ### Building from Source ```sh # Clone the repository git clone https://github.com/ajaysundark/node-readiness-gate-controller.git cd node-readiness-gate-controller # Run tests make test # Build binary make build # Generate manifests make manifests ``` ### Running Locally ```sh # Install CRDs make install # Run against cluster (requires KUBECONFIG) make run ``` ### Contributing 1. Fork the repository 2. Create a feature branch 3. Make changes and add tests 4. Run `make test` to verify 5. Submit a pull request Please ensure: - All tests pass (`make test`) - Code follows Go conventions (`make fmt`, `make vet`) - New features include unit tests - Documentation is updated ## API Reference For detailed API documentation, see the [generated API docs](./docs/api.md) or explore the CRD definition: ```sh kubectl explain nodereadinessgaterule.spec kubectl explain nodereadinessgaterule.status ``` ## Roadmap - [ ] Add documentation capturing design details - [ ] Metrics and alerting integration - [ ] Validation Webhook for rules - [ ] Improve logging and add debugging pointers - [ ] Performance optimizations for large clusters - [ ] Scale testing 1000+ nodes ## Project Distribution [WIP] ### YAML Bundle Generate a single YAML file with all resources: ```sh make build-installer IMG=/nrgcontroller:tag kubectl apply -f dist/install.yaml ``` ### Helm Chart [WIP] ```sh # Generate Helm chart kubebuilder edit --plugins=helm/v1-alpha # Install via Helm helm install nrgcontroller ./dist/chart ``` ## Support - **Issues**: [GitHub Issues](https://github.com/ajaysundark/node-readiness-gate-controller/issues) - **Discussions**: [GitHub Discussions](https://github.com/ajaysundark/node-readiness-gate-controller/discussions) - **Documentation**: [Project Wiki](https://github.com/ajaysundark/node-readiness-gate-controller/wiki) ## Contributing // TODO(ajaysundark): Add detailed information on how you would like others to contribute to this project ## Community, discussion, contribution, and support Learn how to engage with the Kubernetes community on the [community page](http://kubernetes.io/community/). You can reach the maintainers of this project at: - [Slack channel](https://kubernetes.slack.com/messages/sig-node) - [Mailing List](https://groups.google.com/a/kubernetes.io/g/sig-node) ### Code of conduct Participation in the Kubernetes community is governed by the [Kubernetes Code of Conduct](code-of-conduct.md).