Limited Time Offer!
For Less Than the Cost of a Starbucks Coffee, Access All DevOpsSchool Videos on YouTube Unlimitedly.
Master DevOps, SRE, DevSecOps Skills!
Quick Definition (30โ60 words)
Kubernetes manifests are declarative configuration files that describe desired Kubernetes objects and their properties. Analogy: a manifest is like a recipe card that tells a chef what to prepare and how to serve it. Formally: a manifest maps YAML or JSON declarations to the Kubernetes API for desired-state reconciliation.
What is Kubernetes manifests?
What it is:
- A Kubernetes manifest is a declarative specification for one or more Kubernetes API objects, written in YAML or JSON, that an API server accepts to create, update, or delete resources.
- It describes desired state: object kind, metadata, spec, and optional status fields are reconciled by controllers.
What it is NOT:
- Not imperative commands; applying manifests triggers the control loop to reach desired state.
- Not a full CI/CD pipeline; manifests are one artifact in a delivery system.
- Not a runtime binary or image; manifests reference containers and resources but do not contain executable code.
Key properties and constraints:
- Declarative: describes desired state, not steps.
- Idempotent: applying the same manifest multiple times should converge to same state.
- Strong typing: follows Kubernetes API schema for each resource kind.
- Namespaced vs cluster-scoped: some manifests apply in a namespace, others cluster-wide.
- Validation: admission controllers and API server enforce schema and policies.
- Immutability constraints: some fields cannot be changed after creation; updates may require resource recreation.
- Security and RBAC: apply operations require proper permissions.
- Size and complexity: large manifests can be templated, generated, or packaged.
Where it fits in modern cloud/SRE workflows:
- Source of truth for infrastructure as code; versioned in Git.
- Input to GitOps pipelines and CI/CD systems.
- Trigger for automated controllers and operators.
- Basis for policy enforcement, security scans, and compliance audits.
- Integration point for observability and deployment strategies.
Diagram description (text-only):
- Developer writes manifest in Git.
- CI validates manifest (lints, schema check, tests).
- GitOps or CI/CD applies manifest to cluster.
- API server receives manifest and stores desired state in etcd.
- Controllers watch desired state, reconcile actual state by creating pods, services, etc.
- Kubelet and container runtime run workloads; status flows back to controllers, then to API server and GitOps monitors.
Kubernetes manifests in one sentence
A Kubernetes manifest is a declarative file that tells the Kubernetes API what objects and configuration you want, so controllers can reconcile the cluster to that desired state.
Kubernetes manifests vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Kubernetes manifests | Common confusion |
|---|---|---|---|
| T1 | Helm Chart | Template package that generates manifests | Charts are not manifests themselves |
| T2 | Kustomize | Overlay tool that modifies manifests | Output still manifests |
| T3 | Operator | Controller that manages resources automatically | Operators may use manifests internally |
| T4 | CRD | Extends API with new kinds | CRD is not a manifest but defines a schema |
| T5 | Deployment | Specific resource kind described by a manifest | Deployment is one possible manifest kind |
| T6 | Pod | Runtime unit described by manifest | Pod is an object not a file format |
| T7 | GitOps | Workflow using Git as source of truth | GitOps uses manifests for reconciliation |
| T8 | Container image | Binary artifact run by pods | Image referenced by manifests not included |
| T9 | Terraform | Provisioner for infra, different language | Terraform may generate manifests |
| T10 | Kubeconfig | Client auth config for cluster ops | Not a manifest; used to apply manifests |
Row Details (only if any cell says โSee details belowโ)
- None
Why does Kubernetes manifests matter?
Business impact:
- Revenue continuity: correct manifests ensure apps run as designed; misconfigurations cause outages that can impact revenue.
- Trust and brand: repeatable, auditable deployment artifacts reduce unexpected behavior.
- Compliance and auditability: manifests in Git provide historical record for audits and governance.
Engineering impact:
- Reduced toil: declarative desired state reduces manual imperative ops.
- Faster recovery: consistent manifests enable automated rollbacks and reproducible recoveries.
- Velocity: teams can iterate safely when manifests are tested and validated.
SRE framing:
- SLIs: uptime of resources, rollout success rate, manifest application success.
- SLOs: target failure rates for deployments, acceptable rollback frequency.
- Error budgets: track failed deployments and incidents caused by manifest changes.
- Toil: manual edits and ad-hoc fixes are reduced when manifests are automated.
- On-call: clear authoring and review reduces noisy alerts from misconfigured resources.
What breaks in production (realistic examples):
- Mis-specified resource requests/limits cause node OOMs or CPU starvation.
- Incorrect service selectors lead to traffic blackholes.
- Missing liveness/readiness probes cause slow failure detection during rollout.
- Insecure container runtime settings expose privileges, causing security incidents.
- Version skew or API deprecation in manifests causes controller errors after an upgrade.
Where is Kubernetes manifests used? (TABLE REQUIRED)
| ID | Layer/Area | How Kubernetes manifests appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge | Manifests package edge workloads and ingress rules | Request latency, error rates at edge | Ingress controller Helm Kustomize |
| L2 | Network | Service and NetworkPolicy objects in manifests | Network policy denials, packet drops | CNI policy tools NetworkPolicy |
| L3 | Service | Deployments StatefulSets Service manifests | Pod restarts, rollout status, latency | kubectl Helm ArgoCD |
| L4 | Application | ConfigMaps Secrets Volume mounts defined by manifests | Application logs, config reloads | Kustomize SealedSecrets |
| L5 | Data | PersistentVolumeClaims StatefulSet volumes | Disk IOPS, mount failures | CSI drivers Storage provisioners |
| L6 | IaaS | NodePools and cloud-provider manifests (provisioned) | Node churn, provisioning errors | Terraform Cloud Provider |
| L7 | PaaS | Platform charts and manifests for managed services | Service availability, versions | Operator Helm Charts |
| L8 | CI/CD | Pipeline outputs generate manifests | Pipeline pass/fail, lint results | GitHub Actions GitLab CI ArgoCD |
| L9 | Observability | Manifests for agents and exporters | Metrics coverage, scrape errors | Prometheus Fluentd Grafana |
| L10 | Security | PodSecurityPolicy and RBAC manifests | Audit logs, denied actions | OPA Gatekeeper RBAC |
Row Details (only if needed)
- None
When should you use Kubernetes manifests?
When itโs necessary:
- Deploying workloads to Kubernetes clusters.
- Defining infrastructure that Kubernetes manages (Services, PVs, Ingress).
- When you need versioned, auditable desired state for a cluster.
When itโs optional:
- Small ad-hoc clusters for rapid experimentation (imperative kubectl run acceptable short-term).
- When using a managed PaaS that abstracts Kubernetes details; manifests may be unnecessary.
When NOT to use / overuse it:
- Avoid embedding secrets directly in plain manifests; use secret management.
- Donโt treat manifests as a dumping ground for environment-specific settings; use overlays or external config.
- Avoid massive monolithic manifests without modularization; they are hard to review and test.
Decision checklist:
- If you need reproducible, versioned cluster state and automated reconciliation -> use manifests in GitOps.
- If you need per-environment customizations -> use Kustomize or templating and keep base manifests.
- If you require advanced lifecycle logic -> consider Operators instead of large manual manifests.
Maturity ladder:
- Beginner: single-cluster, manifests stored in Git, manual kubectl apply.
- Intermediate: CI pipeline validates manifests, Kustomize or Helm for overlays, basic GitOps.
- Advanced: full GitOps with ArgoCD, automated policies, multi-cluster promotion, operators, canary rollouts, automated remediation.
How does Kubernetes manifests work?
Components and workflow:
- Authoring: Developers write manifests describing resources.
- Validation: Linting, schema checks, security scans run in CI.
- Distribution: Manifests are stored in Git or artifact registry.
- Delivery: GitOps or CI/CD applies manifests to target clusters.
- API server: Receives manifests, validates, and persists desired state to etcd.
- Controllers: Observe desired state and create/modify underlying resources.
- Kubelet/container runtime: Start containers and integrate with node.
- Status feedback: Controllers update status fields; observability systems collect telemetry.
- Reconciliation loop: Controllers continuously reconcile actual state to desired state.
Data flow and lifecycle:
- Git -> CI validation -> API server -> Controllers -> Runtimes -> Metrics/Logs -> GitOps monitor -> Alerts -> Human action (if needed)
Edge cases and failure modes:
- Partial apply due to admission controller rejection.
- Race conditions when multiple controllers update same fields.
- Immutable fields causing failed updates.
- API version deprecation causing manifest incompatibility.
Typical architecture patterns for Kubernetes manifests
- Base and overlays: Keep core manifests as a base and environment-specific overlays with Kustomize.
- Template pipelines: Generate manifests from templates (Helm, Jsonnet) with CI validation for each environment.
- GitOps operator: Git holds manifests; operator applies them and reports drift.
- Operator-managed resources: Use custom resources and operators to encapsulate lifecycle instead of hand-editing complex manifests.
- Multi-cluster promotion: Centralized repo with per-cluster overlays and promotion workflows for canary->staging->prod.
- Immutable artifact approach: Render manifests in CI, store rendered artifacts in an artifact repository, and deploy those immutable manifests.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Apply rejected | kubectl apply error | Schema or admission rejection | Fix manifest or policy | API server error logs |
| F2 | Pod CrashLoop | Frequent restart events | Bad image or startup error | Check logs, fix entrypoint | Pod restart count |
| F3 | Resource starvation | High CPU throttling | Misconfigured requests/limits | Tune resources, HPA | CPU throttling metric |
| F4 | Service routing broken | 404 or no endpoints | Selector mismatch | Fix service selectors | Endpoints count zero |
| F5 | Secret leak | Plaintext secret in repo | Secrets in manifests | Move to secret manager | Git audit alerts |
| F6 | Immutable field error | Update failed | Changing immutable field | Recreate resource properly | API update errors |
| F7 | Drift | Cluster state differs from Git | Manual mutations | Enforce GitOps reconciliation | Drift count metric |
Row Details (only if needed)
- None
Key Concepts, Keywords & Terminology for Kubernetes manifests
The glossary below lists common terms, concise definitions, why they matter, and typical pitfalls.
- API Server โ Central control plane that accepts manifests and stores desired state โ Core Kubernetes entrypoint โ Pitfall: API throttling.
- etcd โ Cluster key-value store for desired state โ Persistence for manifests โ Pitfall: storage contention causes inconsistency.
- Controller โ Reconciler that ensures actual matches desired โ Automates resource lifecycle โ Pitfall: controller crash loops.
- Reconciliation loop โ Continuous process controllers run โ Ensures eventual consistency โ Pitfall: tight loops cause high CPU.
- kubectl โ CLI tool to apply manifests โ Primary developer tool โ Pitfall: manual kubectl edits cause drift.
- Namespace โ Logical grouping for resources โ Scoping and isolation โ Pitfall: resource leaks across namespaces.
- Kind โ Resource type in manifest like Deployment โ Defines API schema โ Pitfall: wrong kind causes errors.
- Metadata โ Name, labels, annotations โ For discovery and ownership โ Pitfall: label mismatches break selectors.
- Spec โ Desired configuration of a resource โ Core configuration area โ Pitfall: missing fields mean defaults differ.
- Status โ Runtime state reported by controllers โ Observability for reconciliation โ Pitfall: status may lag.
- Deployment โ Declarative controller for stateless apps โ Handles rolling updates โ Pitfall: missing strategy causes downtime.
- StatefulSet โ Controller for stateful workloads โ Stable identities and volumes โ Pitfall: improper PVC sizing.
- DaemonSet โ Run pod on each node matching selectors โ For node-level agents โ Pitfall: resource overhead on small nodes.
- Job โ Run short-lived tasks once โ Batch workloads โ Pitfall: not idempotent tasks may rerun.
- CronJob โ Scheduled jobs via manifests โ Periodic tasks โ Pitfall: concurrency policy misconfigurations.
- Service โ Stable network endpoint for pods โ Service discovery โ Pitfall: headless services and unexpected DNS behavior.
- Ingress โ L7 routing rules โ External traffic routing โ Pitfall: controller-specific annotations.
- ConfigMap โ Non-secret configuration data โ Separates config from images โ Pitfall: large ConfigMaps hamper rollout speed.
- Secret โ Sensitive data store โ Avoid plaintext secrets โ Pitfall: improper encoding or exposure.
- PersistentVolume โ Storage resource abstraction โ Durable storage for pods โ Pitfall: capacity and access mode mismatch.
- PersistentVolumeClaim โ Request for PV โ Decouples storage from consumers โ Pitfall: binding delays.
- StorageClass โ Dynamic provisioning policy โ Controls provisioners โ Pitfall: misconfigured reclaimPolicy.
- RBAC โ Role-based access control โ Security for manifest application โ Pitfall: overly permissive roles.
- PodSecurityPolicy โ Deprecated in some versions; policies control pod capabilities โ Security baseline โ Pitfall: cluster upgrade may remove PSP.
- PodDisruptionBudget โ Limits voluntary disruptions โ Controls availability during maintenance โ Pitfall: too strict blocks upgrades.
- Admission controller โ Intercepts requests for validation/mutation โ Enforce policies โ Pitfall: misconfig causing rejects.
- CRD โ Custom Resource Definition to extend API โ Custom resources via manifests โ Pitfall: operator compatibility.
- Operator โ Automation pattern for app lifecycle โ Complex lifecycle encapsulation โ Pitfall: operator bugs can affect many resources.
- Helm โ Templating and packaging for manifests โ Reusable charts โ Pitfall: template complexity hides runtime values.
- Kustomize โ Declarative overlay tool for manifests โ Layered patches โ Pitfall: limited templating features.
- Jsonnet โ Programmable manifest generation โ Complex templating โ Pitfall: steeper learning curve.
- GitOps โ Git as single source of truth for manifests โ Automated reconciliation โ Pitfall: slow feedback loops.
- Canary rollout โ Gradual deployment pattern โ Reduces blast radius โ Pitfall: traffic split config errors.
- Blue-green deploy โ Swap environments to reduce downtime โ Quick rollback โ Pitfall: double resource costs.
- HPA โ Horizontal Pod Autoscaler based on metrics โ Scale pods automatically โ Pitfall: wrong metric targets lead to oscillation.
- VPA โ Vertical Pod Autoscaler adjusts resource requests โ Tuning for resources โ Pitfall: may trigger restarts.
- PodTemplate โ Template inside controllers for pod spec โ Reused across controllers โ Pitfall: accidental mutation breaks deployments.
- Immutable fields โ Fields unchangeable after creation โ Requires recreation โ Pitfall: unexpected errors on apply.
- Finalizer โ Ensures cleanup before deletion โ Resource lifecycle hook โ Pitfall: stuck finalizers prevent deletion.
- Label selector โ Query for grouping resources โ Target for services and controllers โ Pitfall: selector mismatch causes no routing.
- Taint and toleration โ Node scheduling constraints โ Control pod placement โ Pitfall: forgot toleration stops scheduling.
- Affinity/anti-affinity โ Placement preferences โ Improve performance or isolation โ Pitfall: tight rules reduce scheduling.
- ImagePullPolicy โ Controls image retrieval behavior โ Affects caching and updates โ Pitfall: wrong setting uses stale images.
- Sidecar โ Additional container in pod for auxiliary tasks โ Observability or proxy patterns โ Pitfall: coupling lifecycle tightly.
- MutatingWebhook โ Dynamic request mutation โ Enforce defaults or policies โ Pitfall: webhook unavailability blocks creates.
- ValidatingWebhook โ Validates requests against policy โ Enforce guardrails โ Pitfall: false positives block deployments.
- ResourceQuota โ Limits resources per namespace โ Controls consumption โ Pitfall: too-low quotas cause OOMs.
- NetworkPolicy โ Defines traffic rules between pods โ Microsegmentation โ Pitfall: default deny misconfig blocks traffic.
- ServiceAccount โ Identity for pod to call API โ Principle of least privilege โ Pitfall: broad cluster-admin binding.
- ImagePolicyWebhook โ Controls image admission โ Enforce image signing โ Pitfall: blocking unsigned images if not configured.
How to Measure Kubernetes manifests (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Manifest apply success rate | Percentage of successful applies | CI/GitOps apply result events | 99.9% | Backoff retries mask failures |
| M2 | Time to reconcile | Time from apply to desired state | API timestamps and status conditions | < 60s for small changes | Large resources take longer |
| M3 | Deployment success rate | Percent of rollout without rollback | Controller rollout status | 99% | Flaky probes skew numbers |
| M4 | Drift detection rate | How often cluster differs from Git | Periodic diff from GitOps tool | 0% ideally | Temporary drift acceptable during rollout |
| M5 | Failed admission count | Number of rejects by admission | API audit logs | Low single digits per month | Noisy when policies change |
| M6 | Config error rate | Runtime errors from config changes | App logs correlated to deploys | Near zero | Not all errors attributed |
| M7 | Secret usage audit | Accesses to sensitive secrets | Audit logs, secret manager metrics | Monitored anomalies | High cardinality |
| M8 | Rollback frequency | How often rollbacks occur | Deployment history metrics | < 1/month per service | Automated rollbacks vs manual differ |
| M9 | Manifest change lead time | Time from commit to applied | Git commit time to apply time | < 15m for CD | Manual approvals lengthen time |
| M10 | Resource drift repair time | Time to auto-correct drift | GitOps reconcile latency | < 2m | Controllers may be throttled |
Row Details (only if needed)
- None
Best tools to measure Kubernetes manifests
Tool โ Prometheus
- What it measures for Kubernetes manifests: Controller metrics, API server metrics, custom exporter metrics.
- Best-fit environment: Cloud-native clusters with metric scraping.
- Setup outline:
- Deploy Prometheus server with service discovery.
- Configure scrape jobs for kube-state-metrics and controller-manager.
- Instrument GitOps and CI to expose apply metrics.
- Create recording rules for SLI computation.
- Strengths:
- Flexible query language.
- Wide ecosystem and integrations.
- Limitations:
- Requires storage management.
- Complex for long retention.
Tool โ kube-state-metrics
- What it measures for Kubernetes manifests: Exposes cluster object states as metrics.
- Best-fit environment: Observability for reconciled resources.
- Setup outline:
- Deploy kube-state-metrics as a service.
- Scrape with Prometheus.
- Map object metrics to SLIs.
- Strengths:
- Granular object metrics.
- Limitations:
- Not real-time for very short windows.
Tool โ ArgoCD
- What it measures for Kubernetes manifests: Git vs cluster sync status, apply history.
- Best-fit environment: GitOps-driven delivery.
- Setup outline:
- Install ArgoCD and connect to Git repos.
- Configure app projects and sync policies.
- Enable health checks and auto-sync.
- Strengths:
- Visualize drift and sync histories.
- Limitations:
- Not a metrics store; needs integration for SLI pipelines.
Tool โ Flux
- What it measures for Kubernetes manifests: Sync status and reconciliation metrics.
- Best-fit environment: GitOps with Kustomize/Helm integration.
- Setup outline:
- Install Flux and source Git repos.
- Use controllers to deploy and monitor.
- Export metrics to Prometheus.
- Strengths:
- Git-native and modular.
- Limitations:
- Requires more glue for advanced policies.
Tool โ Grafana
- What it measures for Kubernetes manifests: Dashboards for SLI visualization and alerts.
- Best-fit environment: Visual dashboards across teams.
- Setup outline:
- Connect Grafana to Prometheus and logs.
- Create dashboards for reconcile times and apply rates.
- Configure alerting channels.
- Strengths:
- Strong visualization capabilities.
- Limitations:
- Depends on proper data ingestion.
Tool โ Audit Logs (Cloud provider or Kubernetes)
- What it measures for Kubernetes manifests: Who applied what and when.
- Best-fit environment: Security and compliance.
- Setup outline:
- Enable API audit logging.
- Route logs to storage or SIEM.
- Create alerts for sensitive operations.
- Strengths:
- Forensic data.
- Limitations:
- Large volume and requires retention decisions.
Recommended dashboards & alerts for Kubernetes manifests
Executive dashboard:
- Panels:
- Global apply success rate: percentage across clusters.
- Number of deployments in progress per environment.
- Drift count across clusters.
- High-level incidents related to manifests.
- Why: Provides leadership with risk and deployment health.
On-call dashboard:
- Panels:
- Recent failed applies with user/commit.
- Rollback frequency for services.
- Current reconciliations in progress and their durations.
- Critical pod CrashLoopBackoff instances after recent deploys.
- Why: Quick triage and mitigation for deployment issues.
Debug dashboard:
- Panels:
- API server error rate and latency.
- Controller reconciliation time per object type.
- Pod restart counts and container logs for failing pods.
- Git commit to apply latency histogram.
- Why: Deep diagnostics for troubleshooting.
Alerting guidance:
- Page vs ticket:
- Page for production rollout failures causing service outage or mass failures.
- Ticket for non-urgent failed applies without impact or gated by approvals.
- Burn-rate guidance:
- If deployment failure burn rate consumes more than 20% of error budget over a short window, escalate to SRE.
- Noise reduction tactics:
- Deduplicate alerts by grouping failures by commit or app.
- Suppress alerts during known maintenance windows.
- Use aggregation windows to avoid transient flaps.
Implementation Guide (Step-by-step)
1) Prerequisites – Git repository for manifests with branch protection. – CI pipeline for linting and unit tests. – Cluster access with proper RBAC and service accounts. – Observability to collect metrics and logs. – Secret management solution.
2) Instrumentation plan – Expose apply and reconcile events as metrics. – Add probes and resource metrics for workloads. – Capture audit logs and controller metrics.
3) Data collection – Configure Prometheus to scrape kube-state-metrics and API server. – Send logs to a centralized log store and index deploy-related logs. – Capture Git events and CI pipeline outcomes.
4) SLO design – Define SLIs for deployment success and reconcile time. – Quantify acceptable failure and set SLOs with error budget. – Define alerting thresholds and escalation flow.
5) Dashboards – Create executive, on-call, and debug dashboards. – Include change history panels linked to commits.
6) Alerts & routing – Route critical deployment pages to on-call SRE. – Route non-critical failures to engineering teams via ticketing. – Configure dedupe and suppression logic.
7) Runbooks & automation – Runbooks for apply failures, rollbacks, and security incidents. – Automate common remediations (recreate immutable fields, retry transient failures).
8) Validation (load/chaos/game days) – Run load tests that exercise new manifests under production-similar load. – Conduct chaos experiments like controller restarts and network partitions. – Run game days to validate runbooks.
9) Continuous improvement – Postmortem on failed deploys. – Track common fixes and reduce friction in CI validation. – Incrementally improve templates and automation.
Pre-production checklist:
- Lint and schema validation passed.
- Security scan for images and manifests.
- Resource requests and limits set.
- Probes configured.
- Test manifests deployed to staging.
Production readiness checklist:
- Approval from owners and SRE.
- Canary or phased rollout configured.
- Monitoring and alerting in place.
- Rollback strategy validated.
- Backup for stateful data if needed.
Incident checklist specific to Kubernetes manifests:
- Identify last manifest commit and author.
- Check GitOps sync status and drift.
- Inspect controller and API server logs.
- Validate resource usage and events.
- Execute rollback or hotfix manifest as appropriate.
Use Cases of Kubernetes manifests
-
Continuous delivery for microservices – Context: Frequent deployments across many services. – Problem: Manual deployments cause inconsistency. – Why manifests help: Declarative, versioned changes and automatic reconciliation. – What to measure: Deployment success rate, reconcile time. – Typical tools: ArgoCD, Prometheus, Helm.
-
Multi-tenant platform management – Context: Platform team managing many namespaces. – Problem: Enforce quotas and security per tenant. – Why manifests help: Apply consistent namespace manifests and quotas. – What to measure: Namespace drift, quota breaches. – Typical tools: Kustomize, OPA Gatekeeper.
-
Stateful applications with PVs – Context: Databases requiring persistent storage. – Problem: Data consistency and lifecycle complexity. – Why manifests help: Define PVCs and StorageClass declaratively. – What to measure: PV bind time, IOPS, backup success. – Typical tools: CSI drivers, Velero.
-
Observability agent rollout – Context: Need consistent telemetry across clusters. – Problem: Agents inconsistent or missing. – Why manifests help: DaemonSet or Deployment manifests ensure agents exist. – What to measure: Metrics coverage, scrape health. – Typical tools: Prometheus, Fluentd, kube-state-metrics.
-
Security policy enforcement – Context: Regulatory requirements for pod capabilities. – Problem: Inconsistent security posture. – Why manifests help: RBAC, PodSecurity admission, and PSPs can be enforced. – What to measure: Admission rejects, policy violations. – Typical tools: OPA Gatekeeper, audit logs.
-
Blue-green deployment of critical service – Context: Service with high availability needs. – Problem: Risky upgrades. – Why manifests help: Define separate environments and switch traffic via Service/Ingress. – What to measure: Request success rates during swap. – Typical tools: Service meshes, Ingress controllers.
-
Autoscaled batch workloads – Context: Variable batch jobs. – Problem: Resource waste and slow execution. – Why manifests help: Job and HPA objects tune scale behaviors. – What to measure: Job completion time and cost. – Typical tools: HPA, CronJob, cluster autoscaler.
-
Edge and IoT deployments – Context: Many small clusters at edge locations. – Problem: Hard to manage consistent config. – Why manifests help: GitOps applies manifests to many clusters reliably. – What to measure: Sync status per cluster, rollout time. – Typical tools: ArgoCD, Flux.
-
Canary feature rollout – Context: Gradual feature exposure. – Problem: Full rollout risk. – Why manifests help: Define progressive routing and ephemeral resources. – What to measure: Error rates for canary vs baseline. – Typical tools: Service mesh, traffic-split controllers.
-
Onboarding third-party operators – Context: Managed services require CRDs. – Problem: Complex lifecycle and compatibility. – Why manifests help: CRDs and operator manifests declare contracts for automation. – What to measure: Operator reconciliation errors. – Typical tools: Operator SDK, Helm.
Scenario Examples (Realistic, End-to-End)
Scenario #1 โ Kubernetes manifest deployment causing regressions
Context: Microservice team deploys a new container image via a Deployment manifest. Goal: Deploy v2 without downtime. Why Kubernetes manifests matters here: Deployment manifest defines rollout strategy, probes, and resources. Architecture / workflow: Git commit -> CI lint -> Render manifest -> ArgoCD sync -> API server -> Deployment controller -> ReplicaSet -> Pods. Step-by-step implementation:
- Update image tag in Deployment manifest in feature branch.
- CI runs schema and lint checks and unit tests.
- Merge to main triggers GitOps sync.
- ArgoCD applies and starts a rolling update with maxUnavailable=1.
- Observe pod readiness and application metrics.
- If errors exceed threshold, ArgoCD or automated rollback executes. What to measure: Deployment success rate, pod restart count, user-facing error rate. Tools to use and why: ArgoCD for safe sync, Prometheus for metrics, Grafana for dashboards. Common pitfalls: Missing readiness probe delays rollout detection. Validation: Canary traffic test then ramp to full. Outcome: Smooth rollout or automatic rollback with minimal user impact.
Scenario #2 โ Serverless managed-PaaS with manifest-driven configuration
Context: Organization uses a managed Kubernetes service that supports serverless containers via knative or platform add-on. Goal: Deploy an event-driven function with autoscaling to zero. Why Kubernetes manifests matters here: Manifest defines service and autoscaling parameters that the managed platform reads. Architecture / workflow: Git commit -> CI validation -> Service manifest for serverless object -> Platform controller provisions scaling and networking. Step-by-step implementation:
- Write serverless service manifest with concurrency and scaling annotations.
- Validate and commit to Git.
- GitOps deploys manifest to cluster.
- Platform reconciler provisions routes and autoscaling to zero. What to measure: Cold-start latency, scale-to-zero time, invocation success rate. Tools to use and why: Platform logging and metrics; Prometheus for custom metrics. Common pitfalls: Missing annotations prevent scale-to-zero. Validation: Load test cold-starts and verify billing impact. Outcome: Efficient cost model with predictable behavior.
Scenario #3 โ Incident response and postmortem for manifest-induced outage
Context: A manifest change accidentally removed a PodDisruptionBudget causing simultaneous eviction and outage. Goal: Restore service and analyze root cause. Why Kubernetes manifests matters here: The PDB manifest protected availability; its removal triggered SLO breach. Architecture / workflow: Git commit -> CI -> Apply -> Nodes drained -> Pods evicted -> Service degraded. Step-by-step implementation:
- Pager alerts on high error rates.
- SRE examines recent manifests and finds PDB removal commit.
- Revert commit and re-apply PDB manifest via hotfix pipeline.
- Restore availability and monitor recovery.
- Postmortem documents why commit passed and how to prevent recurrence. What to measure: Time to recovery, number of affected requests, root-cause time to discovery. Tools to use and why: Audit logs to find commit, ArgoCD to revert, Prometheus for SLO analysis. Common pitfalls: Long reconcile delays due to CI approvals. Validation: Run chaos tests for PDB-related failures. Outcome: Restored service and policy to require manual approval for PDB changes.
Scenario #4 โ Cost vs performance trade-off for manifests with resource tuning
Context: Batch jobs use high default resource limits causing unnecessary cluster cost. Goal: Reduce cost while keeping job completion SLA. Why Kubernetes manifests matters here: Resource requests and limits in manifests drive scheduling and resource bills. Architecture / workflow: Git commit -> CI -> Test job manifest in staging -> Perf tests -> Apply tuned manifest to production. Step-by-step implementation:
- Collect historical job CPU and memory usage metrics.
- Update Job manifest with optimized resource requests and limits and HPA where applicable.
- Run load tests in staging to validate completion times.
- Roll out tuned manifests gradually. What to measure: Job cost per run, completion time, retry count. Tools to use and why: Prometheus for usage, cost dashboards for money impact. Common pitfalls: Too-low resources cause increased failures. Validation: Compare cost metrics before and after over multiple runs. Outcome: Reduced cost within acceptable performance bounds.
Common Mistakes, Anti-patterns, and Troubleshooting
- Symptom: Frequent pod restarts -> Root cause: Missing or incorrect readiness/liveness probe -> Fix: Add proper probes and test.
- Symptom: Service has zero endpoints -> Root cause: Service selector labels mismatch pods -> Fix: Align labels or selectors.
- Symptom: Apply rejected by admission -> Root cause: Policy violation -> Fix: Update manifest to match policy or update policy.
- Symptom: Secrets leaked in Git -> Root cause: Secrets stored in plaintext manifests -> Fix: Move to secret manager and rekey.
- Symptom: High CPU throttling -> Root cause: Low CPU requests -> Fix: Increase requests or tune HPA.
- Symptom: Deployment never completes -> Root cause: Immutable field change required recreate -> Fix: Recreate resource with correct spec.
- Symptom: Drift detected frequently -> Root cause: Manual kubectl edits in cluster -> Fix: Enforce GitOps and lock down write permissions.
- Symptom: Inconsistent behavior across environments -> Root cause: Environment-specific values baked into base manifests -> Fix: Use overlays or templating.
- Symptom: Long reconcile times -> Root cause: Large manifests or controllers under-resourced -> Fix: Split manifests and scale controllers.
- Symptom: Admission webhook blocks creates -> Root cause: Webhook downtime -> Fix: Add fail-open or ensure webhook HA.
- Symptom: Misrouted traffic after rollout -> Root cause: Ingress annotation mismatch for controller -> Fix: Update annotations and ingress class.
- Symptom: StatefulSet PVC not bound -> Root cause: StorageClass mismatch or capacity shortage -> Fix: Correct storage class and ensure provisioner.
- Symptom: Branch-override manifests not applied -> Root cause: GitOps path misconfiguration -> Fix: Adjust repository path and project settings.
- Symptom: High alert noise on deploys -> Root cause: Alert thresholds too low for normal rollout behavior -> Fix: Add suppression during deploys and tune thresholds.
- Symptom: Container runs as root unexpectedly -> Root cause: SecurityContext not set -> Fix: Enforce non-root via PodSecurity or manifested securityContext.
- Symptom: Job duplicates running -> Root cause: CronJob concurrency policy not set -> Fix: Set Forbid or Replace as needed.
- Symptom: Node resource exhaustion after DaemonSet -> Root cause: Agent resource footprint too high -> Fix: Tune resources and scheduling constraints.
- Symptom: Secrets not mounted -> Root cause: Secret missing or name mismatch -> Fix: Ensure secret exists in same namespace and name is correct.
- Symptom: Large merge conflicts in manifests -> Root cause: Monolithic manifest files -> Fix: Modularize with Kustomize or Helm charts.
- Symptom: Operators fail after upgrade -> Root cause: CRD schema changed -> Fix: Validate operator compatibility and plan migrations.
- Symptom: Observability gaps after deploy -> Root cause: Missing sidecar or agent not deployed -> Fix: Add necessary manifests or DaemonSets.
- Symptom: RBAC denies apply -> Root cause: Insufficient permissions for service account -> Fix: Grant minimal required roles and reapply.
- Symptom: Auto-scaling oscillation -> Root cause: Wrong metric selection or aggressive thresholds -> Fix: Smooth scaling with cooldowns.
- Symptom: Pod scheduled to wrong node -> Root cause: Taints and tolerations misconfigured -> Fix: Adjust tolerations or remove taint.
- Symptom: Hidden config drift -> Root cause: ConfigMaps updated manually -> Fix: Source config in Git and enforce pipeline.
Observability pitfalls (at least 5 included above):
- Missing reconciliation metrics prevents SLA assessment.
- Relying solely on Pod readiness without application-level checks.
- Ignoring drift metrics leading to hard-to-debug issues.
- Alerts not tied to manifest changes create noisy paging.
- Lack of audit logs prevents tracing who applied a harmful manifest.
Best Practices & Operating Model
Ownership and on-call:
- Assign clear ownership per manifest group and declare on-call responsibilities.
- SREs handle platform-level manifests; teams own app manifests.
Runbooks vs playbooks:
- Runbooks: step-by-step operational instructions for known incidents.
- Playbooks: decision trees for complex or novel situations.
Safe deployments:
- Use canary or phased rollouts and automated rollbacks.
- Use PodDisruptionBudgets to ensure availability during maintenance.
- Validate manifests in staging identical to production.
Toil reduction and automation:
- Automate rendering, validation, and policy checks in CI.
- Use GitOps to eliminate manual kubectl changes.
- Use operators for complex lifecycle automation.
Security basics:
- Use RBAC and service accounts with least privilege.
- Do not commit secrets to repos.
- Enforce admission policies for image signing and capabilities.
Weekly/monthly routines:
- Weekly: Review failing reconciliations and drift.
- Monthly: Audit RBAC and admission webhook health.
- Quarterly: Review major manifests for deprecated API versions.
What to review in postmortems related to manifests:
- Who changed what manifest and why.
- Why CI checks didnโt catch the issue.
- Whether automated rollback worked as expected.
- Changes to policies or tooling required.
Tooling & Integration Map for Kubernetes manifests (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | GitOps | Continuously apply manifests from Git | Helm Kustomize ArgoCD Flux | Central repo as source of truth |
| I2 | Templating | Generate manifests from templates | CI tools and Helm | Use for reusable charts |
| I3 | Policy | Enforce manifest policies at admit time | OPA Gatekeeper Kyverno | Prevent bad manifests from applying |
| I4 | Secrets | Manage secrets referenced by manifests | KMS Secret store SealedSecrets | Avoid plaintext secrets |
| I5 | Storage | Provision PVs and volumes from manifest claims | CSI drivers StorageClass | Dynamic provisioning |
| I6 | CI | Lint, test, and render manifests | GitHub Actions GitLab CI | Automated validation pipeline |
| I7 | Observability | Collect metrics and logs for resources | Prometheus Grafana Loki | Monitor deployment and runtime |
| I8 | Audit | Capture apply events and changes | Cluster audit logs SIEM | For compliance and forensics |
| I9 | Autoscaling | Scale workloads based on metrics | HPA VPA ClusterAutoscaler | Tie to resource requests and SLIs |
| I10 | Operator framework | Run operators managing CRs | Operator SDK Helm | Encapsulate lifecycle logic |
Row Details (only if needed)
- None
Frequently Asked Questions (FAQs)
What file formats are Kubernetes manifests written in?
Most commonly YAML or JSON; YAML is the prevalent format due to readability.
Can I use templates for manifests?
Yes, tools like Helm, Kustomize, and Jsonnet generate manifests; templates must still be validated.
Should I store manifests in Git?
Yes; Git provides versioning, review, and audit history and serves as source of truth in GitOps.
How do I manage secrets referenced by manifests?
Use secret management solutions or sealed secrets; do not store plaintext secrets in Git.
What is the difference between apply and create?
Create fails if resource exists; apply reconciles and updates declaratively.
How do I prevent accidental cluster-wide changes?
Use RBAC, admission controls, and protected branches for manifests affecting cluster-scoped resources.
How to handle Kubernetes API deprecations in manifests?
Track API versions and upgrade manifests during cluster upgrades; run validation tests in CI.
Are manifests enough for complex lifecycle operations?
Sometimes not; consider Operators for complex stateful lifecycle automation.
How to test manifests before production?
Render manifests in CI, deploy to staging, use contract tests and canary rollouts.
How to roll back a bad manifest change?
Revert the commit in Git and let GitOps reapply or use controller rollback features like Deployment rollbacks.
How to handle environment-specific configuration?
Use overlays with Kustomize or values with Helm; avoid hardcoding environment differences in base manifests.
How to detect drift between Git and cluster?
Use GitOps tools that report sync status and diff capabilities; schedule periodic audits.
What probes should I set in manifests?
Both readiness and liveness probes; readiness for traffic control, liveness for lifecycle management.
How to avoid noisy alerts during deployments?
Suppress alerts for expected transient conditions and use grouping and deduplication.
What are immutable fields and why do they matter?
Fields that cannot be changed after creation; changing them requires recreation which should be planned.
How to handle secret rotation?
Update secret store and trigger rollout of dependent workloads via manifest updates or annotations.
How to ensure manifests meet security policies?
Use admission controllers and policy tools to validate and block non-compliant manifests.
How to scale manifest management for many teams?
Use modular repos, standardized templates, and a platform team to provide curated base manifests.
Conclusion
Kubernetes manifests are the backbone of declarative infrastructure and application deployment in cloud-native environments. They enable reproducibility, automation, and policy enforcement when combined with GitOps, CI validation, and observability. Treat manifests as first-class artifacts: version, test, monitor, and protect them to reduce incidents and improve deployment velocity.
Next 7 days plan:
- Day 1: Inventory current manifests and store them in a protected Git repo.
- Day 2: Add schema linting and basic security scans to CI.
- Day 3: Deploy kube-state-metrics and Prometheus to collect object metrics.
- Day 4: Implement GitOps sync for one non-critical service.
- Day 5: Create on-call and debug dashboards for deployment metrics.
- Day 6: Run a rehearsal rollback and document a runbook.
- Day 7: Conduct a postmortem on any issues and iterate on checks.
Appendix โ Kubernetes manifests Keyword Cluster (SEO)
- Primary keywords
- Kubernetes manifests
- Kubernetes manifest guide
- Kubernetes YAML manifests
- Kubernetes manifest examples
-
Kubernetes declarative config
-
Secondary keywords
- GitOps manifests
- Helm chart vs manifests
- Kustomize manifests
- Kubernetes manifest best practices
-
Kubernetes manifest security
-
Long-tail questions
- How to write Kubernetes manifests for production
- What are common Kubernetes manifest mistakes
- How to manage secrets in Kubernetes manifests
- How GitOps applies Kubernetes manifests
-
How to test Kubernetes manifests in CI
-
Related terminology
- Deployment manifest
- StatefulSet manifest
- Service manifest
- Ingress manifest
- ConfigMap manifest
- Secret manifest
- PersistentVolumeClaim manifest
- PodDisruptionBudget manifest
- PodSecurityPolicy manifest
- Role and RoleBinding manifest
- CustomResourceDefinition manifest
- Helm chart manifest
- Kustomize overlay
- GitOps sync
- Reconciliation loop
- Controller manager
- kube-state-metrics
- Audit logs
- Admission controller
- Sidecar manifest
- DaemonSet manifest
- CronJob manifest
- Job manifest
- Pod manifest
- ServiceAccount manifest
- ResourceQuota manifest
- NetworkPolicy manifest
- StorageClass manifest
- CSI manifest
- ImagePullPolicy setting
- MutatingWebhook manifest
- ValidatingWebhook manifest
- Operator manifest
- HorizontalPodAutoscaler manifest
- VerticalPodAutoscaler manifest
- Canary deployment manifest
- Blue-green deployment manifest
- Rollback manifest
- Immutable fields in manifest
- Finalizer manifest
- Label selector manifest
- Taint and toleration manifest
- Affinity manifest
- PodTemplate manifest
- Admission policies for manifests
- Policy as code manifests
- Manifest linting
- Manifest validation in CI
- Manifest drift detection
- Manifest reconciliation time
- Manifest apply success rate
- Manifest SLOs and SLIs
- Manifest observability
- Manifest runbooks
- Manifest CI/CD integration
- Manifest automation
- Manifest lifecycle management
- Manifest security audit
- Manifest change lead time
- Manifest canary metrics
- Manifest cost optimization

Leave a Reply