Best Practices

Repository Structure and Organization

Avoid Monolithic Repositories and Binaries

The Best Practice: Maintain lean Git repositories. Do not commit large binaries, images, or video files to your repositories. Large repositories increase cloning times and outbound traffic, which can overwhelm the gitjob controller. Limit the total number of watched repositories to 100 or fewer.

The Implementation: When managing large repositories in CI pipelines or locally, use the FLEET_BUNDLE_CREATION_MAX_CONCURRENCY environment variable with the fleet apply command. This limits the parallelism of bundle creation and prevents CPU starvation.

Isolate Branches to Prevent Redundant Loops

The Best Practice: Avoid having multiple GitRepo resources watch the same repository and branch. A single commit to a shared branch triggers an update for all associated GitRepo resources. This effectively acts like calling a ForceUpdate across all of them, forcing a reconciliation loop even if there were absolutely no changes to the specific paths used in those individual GitRepos.

The Implementation: Configure the GitRepo custom resource to target specific paths within the repository. Do not use overlapping watches on the root directory. This segregates the areas Fleet monitors and reduces unnecessary deployment activity.

Avoid Deploying Nested Fleet Resources

The Best Practice: Avoid deploying nested Fleet resources whenever possible. Specifically, try to avoid using a GitRepo to deploy additional GitRepo resources, or having a GitRepo directly deploy other Fleet Bundles.

The Implementation: Although nested GitRepo Custom Resources are technically supported by Fleet, this approach creates chained, multi-layered reconciliation loops. Nesting obscures observability, makes debugging extremely difficult if a deployment stalls, and adds unnecessary processing overhead to the control plane. Flatten your repository management structure by declaring all necessary GitRepo resources at the top level of your management cluster. If specific deployment ordering is required, rely on the dependsOn array within your fleet.yaml to enforce strict ordering natively across bundles.

Bundle Design

Understand the Primary Scaling Limit (Clusters × Bundles)

The Best Practice: The primary bottleneck for Fleet deployments is the total number of BundleDeployment objects stored in etcd. This limit is determined by the number of clusters multiplied by the number of bundles.

The Implementation: Monitor the generated resources in the management cluster. For every cluster that a bundle targets, Fleet creates a BundleDeployment object and an associated Secret in the cluster-fleet-default namespace. Fleet also creates a Secret for each Bundle. However, this Bundle Secret doesn’t significantly affect scaling because each path in a GitRepo resource generates only one bundle. Exponential growth occurs because every target cluster multiplies the total number of generated resources.

BundleDeployment and Bundle secrets are only created when Helm values are used, which effectively doubles the etcd footprint to (Clusters × Bundles × 2) for Helm-style deployments

Account for Resource Multipliers

The Best Practice: The resource footprint in etcd varies by bundle format. Manifests or Kustomize bundles result in a 1:1 ratio (Clusters × Bundles). However, Helm-style bundles increase this footprint to (Clusters × Bundles × 2) + Bundles. This occurs because Fleet generates an additional Secret for Helm values for every BundleDeployment

The Implementation: Track the Secret (bundle-deployment) and Secret (content-access) resources generated by the Fleet controller. Ensure your etcd quotas can accommodate these automatically created objects.

Consolidate with Helm Subcharts

The Best Practice: Combine multiple application components into a single bundle using Helm subcharts. This consolidation reduces the total number of bundles and prevents etcd database exhaustion.

The Implementation: Structure your GitRepo path as a unified Helm chart. Fleet renders the content, whether manifests, Kustomize, or Helm into a single Helm chart and installs it on the downstream cluster as a single release.

Environment Management

Use targetCustomizations Carefully

The Best Practice: Define environment-specific rules within Git. However, do not use targetCustomizations to override helm.chart, helm.repo, or helm.version to fetch different charts dynamically. Doing so forces Fleet to include multiple chart versions in the same bundle, massively bloating the bundle size and potentially exceeding etcd limits.

The Implementation: Use the targetCustomizations array in your fleet.yaml file to customize options, such as Helm values, for different environments. Don’t use targetCustomizations to select the target clusters themselves; use the targets: array in the GitRepo resource for cluster selection. Fleet evaluates your target customizations in order and applies the first match it finds for a cluster’s labels. If you require fundamentally different charts, define separate GitRepo resources instead.

GitRepo Configuration

Prioritize Webhooks over Polling

The Best Practice: Do not rely on constant polling for Git repositories, as it leads to rate-limiting and high CPU utilization at scale. Configure webhooks to ensure Fleet only consumes resources when a change is detected.

The Implementation: Enable webhook events for the gitjob-controller. When a webhook triggers, the controller creates a job to clone the repository and generate the Fleet bundle immediately.

When webhooks are configured, Fleet automatically scales back polling intervals to 1 hour.

Targeting Clusters

Keep Targeting Rules in Git

The Best Practice: Maintain all targeting rules within the Git repository. Avoid defining application-to-cluster mappings through a UI or DB. Keeping rules in a Git repository ensures the state is declarative and version-controlled.

The Implementation: Use a ClusterGroup CRD to define reusable selectors (e.g., env=prod). Map workloads to these clusters using label matching in the targets: section of your fleet.yaml to dynamically evaluate and deploy to matching clusters as they come online.

Multi-Cluster Strategy and Scaling

The scaling limit of Fleet is the total number of BundleDeployments (calculated as downstream clusters × deployed bundles). When scaling to 5,000 - 10,000 clusters, this multiplier creates hundreds of thousands of resources that will overwhelm a single monolithic controller.

Invert the Traffic Flow (Pull vs. Push)

The Best Practice: While Fleet always uses a pull model for application deployments, it offers both pull (agent-initiated) and push (manager-initiated) modes for initial cluster registration. To onboard thousands of clusters, use agent-initiated registration. This pull-based model handles unreliable networks gracefully and avoids exposing the Kubernetes APIs of every remote cluster to the internet.

The Implementation: Generate a ClusterRegistrationToken on the central cluster. The fleet-agent on the downstream cluster uses this token to authenticate and pull its assigned BundleDeployment resources.

  • Avoid manager-initiated registration at massive scale; while it can push new agents to downstream clusters, it requires the central manager to have direct inbound network access to all downstream APIs (via kubeconfig) and introduces unnecessary reconciliation overhead.

  • For deployments exceeding a few thousand clusters, distribute the control-plane workload by configuring Fleet sharding and grouping.

Implement Sharding

The Best Practice: Sharding primarily improves performance during the bundle creation (apply) phase, where repositories are monitored and cloned. It doesn’t significantly improve the targeting phase, which is already fast, and it provides no benefit during the deployment phase, where only the downstream agents and the control plane Kubernetes API server are involved.

The Implementation: Assign shard labels to GitRepo resources using fleet.cattle.io/shard-ref label to distribute workloads. Configure dedicated Fleet controllers to only process repositories that match their specific shard index.

To maximize performance benefits, use node selectors to assign different controller shards to different dedicated Kubernetes worker nodes. This approach is highly recommended if a single node doesn’t have enough CPU to handle the processing load.

Performance and Hardware Tuning (Control Plane)

Tune etcd for High Scale

The Best Practice: Back etcd with fast NVMe (Non-Volatile Memory Express) or SSD storage using ext4 or xfs filesystems (avoid btrfs). Increase the etcd keyspace quota from the default 2 GB to 8 GB to accommodate large resource volumes.

The Implementation: Bundles and Content resources reside in etcd. Monitor their footprint by querying bundles.fleet.cattle.io and bundledeployments.fleet.cattle.io via the upstream API.

Dedicate Nodes for Fleet

The Best Practice: The Fleet controller is CPU-intensive and can overwhelm the Kubernetes API server. Schedule the controller on dedicated worker nodes to ensure consistent performance.

The Implementation: Apply the nodeSelector.role=agent configuration to the Fleet controller deployment to pin it to dedicated compute nodes.

Optimize Control Plane Caching

The Best Practice: Caching in the control plane is essential for reducing API load, but at massive scale, the Kubernetes API server can become overwhelmed and return stale cached data. To mitigate this, you must optimize what gets cached and actively monitor the control plane for stale data.

The Implementation: Set the CATTLE_SYNC_ONLY_CHANGED_OBJECTS=mgmt,user on the Rancher pods to instruct the controllers to only cache and sync resources that have actually changed, which reduces memory footprint and API server load.

You can use the fleet monitor CLI tool to detect API Time Travel by fetching resources multiple times to confirm if the API server is serving stale cached data. You can pipe this output into fleet analyze to diagnose exactly why a bundle deployment is stuck.

Secrets Management

Use valuesFrom: Do Not Store Plain-Text Secrets in Git

The Best Practice: Never embed plain-text credentials in fleet.yaml or Git repositories. Keep credentials local to the cluster or manage them via secure injection.

The Implementation: To copy credentials from the upstream cluster, use the downstreamResources field in the HelmOps CRD. This dynamically copies existing upstream secrets to downstream clusters before deployment.

Drift Management and Edge Deployments

Delegate Drift Correction to the Edge

The Best Practice: Perform drift detection and correction locally on the downstream cluster. This allows the system to maintain the desired state even when disconnected from the management cluster.

The Implementation: The fleet-agent handles this natively. However, to prevent false Modified states caused by expected runtime drift (e.g., CA certs injected by webhooks), run the fleet bundlediff CLI command. It outputs a ready-to-use JSON pointer patch (e.g., $setElementOrder/ELEMENTNAME) that you can paste into the diff.comparePatches array in fleet.yaml.

Fleet automatically ignores standard Horizontal Pod Autoscalers(HPA) replicas changes as long as they stay within the HPA’s configured min/max range.

Configure Ignore Rules for Known Drift

The Best Practice: Strictly maintain the Git repository as the single source of truth for your deployments. You must avoid allowing external tools or manual interventions to modify resources currently managed by Fleet.

However, certain resources are naturally amended at runtime by native Kubernetes controllers or webhooks (such as a cluster injecting a CA certificate into a ValidatingWebhookConfiguration). To prevent these expected runtime mutations from causing false Modified or OutOfSync statuses, you must instruct Fleet to explicitly ignore those specific fields.

The Implementation: Use the ignore list in fleet.yaml to specify fields or resources that the Fleet agent should disregard during status monitoring.

Bandwidth-Constrained Environments (Edge)

Increase Refresh Intervals for External Repositories

The Best Practice: In bandwidth-constrained edge environments, minimize Git fetch traffic to external repositories.

The Implementation: Increase the refreshInterval on ClusterRepo resources to a high value, such as several months or a year. You can still force manual updates through the Rancher UI when necessary.

Avoid Push Alerts for Air-Gapped Setups

The Best Practice: Use the agent-based pull model to prevent false deployment failures when air-gapped clusters lose connectivity. Clusters will reconcile automatically once the connection is restored.

The Implementation: To stop a deployment explicitly, set paused: true in fleet.yaml. The bundle will enter an OutOfSync status gracefully without triggering errors.

Testing and CI/CD Integration

Dry Run and Testing CLI*

The Best Practice: Validate templates and preview cluster targeting locally before pushing changes to the management cluster.

The Implementation: Use the Fleet CLI for validation:

Managing Rollouts and Image Pull Storms

Prevent Image Pull Storms

The Best Practice: Avoid simultaneous container image pulls across thousands of edge nodes, which can crash registry networks.

The Implementation: Control rollout batches using variables in fleet.yaml, such as maxNew , autoPartitionSize, and maxUnavailable to stage rollouts in controlled batches. Additionally, use the Schedule CRD to define specific windows when deployments are permitted.

The default values for:

  • maxNew is 50

  • autoPartitionSize is 25%

  • maxUnavailable is 100%

Observability and Debugging

Use the Fleet Benchmark Suite

The Best Practice: Do not estimate hardware capacity. Run benchmarks to identify bottlenecks in bundle creation and target matching for your specific environment.

The Implementation: Run ./fleet-benchmark run to generate simulated GitRepo resources and measure processing times through to the BundleDeployment stage.

Monitor API Time Travel and Stuck Resources

The Best Practice: Regularly audit upstream logs for storage errors or stuck states where reconciliation fails.

The Implementation: Use the Fleet CLI for diagnostics:

  • fleet monitor: Outputs a snapshot of the current state in JSON.

  • Available from version 2.14

  • fleet analyze: Diagnoses specific resource issues.

  • Available from version 2.14

  • fleet dump: Exports raw Kubernetes data for troubleshooting.

  • Available from version 2.14

  • Reference: Fleet Troubleshooting, Fleet Monitor Reference.

OCI Storage Best Practices

Offload Massive Bundles to OCI

The Best Practice: If your Kubernetes bundles compress to more than 1MB, etcd cannot store them. Use OCI registries to store Helm charts or large bundles; Fleet pulls them directly from the registry, drastically reducing API traffic and etcd database bloat.

The Implementation: Create a Kubernetes secret named ocistorage in the GitRepo namespace or reference a custom secret via the ociRegistrySecret field in your GitRepo custom resource. While OCI registries are typically used to store container images, Fleet uses them in this scenario to store your bundles or charts as OCI artifacts. Fleet automatically checks the integrity of these deployed OCI artifacts and pulls the ones tagged as latest.

By default, Fleet stores Kubernetes bundle resources directly in the etcd database.

Beware of the Secret Penalty

The Best Practice: While OCI reduces manifest size in etcd, it increases the total object count.

The Implementation: Using OCI creates a resource multiplier of (Clusters × Bundles × 3). This is due to the creation of an additional Secret (content-access) per downstream cluster for registry authentication.

Common Scale Pitfalls to Avoid

Exceeding API Server Limits

The Best Practice: Do not exceed 100,000 BundleDeployments (e.g., deploying 150 bundles to 2,000 clusters). Beyond this point, standard API queries (like kubectl get) may take several minutes, destabilizing the cluster.

The Implementation: Run fleet cleanup (or subcommands like fleet cleanup clusterregistration) periodically to remove outdated registrations and orphaned jobs.

Pre-configure CRDs for Encryption

The Best Practice: If you use Git to secure credentials, ensure the control plane is prepared. Failure to do so break deployments.

The Implementation: If Kubernetes encryption at rest is enabled on the management cluster, add Fleet CRDs (such as gitrepos.fleet.cattle.io1 and bundles.fleet.cattle.io) to the encryption resource list.

Advanced Configuration and Security

Override Target Rules Without Modifying Git

The Best Practice: Use overrides to quickly change deployment targets without modifying the primary Git repository layout or affecting other environments.

The Implementation: Apply the overrideTargets array in fleet.yaml. This intercepts and replaces the targets defined in the GitRepo resource, allowing for temporary routing or localized testing.

Restricting GitRepo Deployments (Multi-Tenancy)

The Best Practice: In multi-tenant environments, prevent users from deploying bundles into unauthorized namespaces or using restricted service accounts.

The Implementation: Deploy a GitRepoRestriction CRD. This allows administrators to define whitelists for allowedTargetNamespaces, allowedServiceAccounts, and allowedRepoPatterns.