GitOps: thoughts, challenges, etc

GitOps is a great idea: describe the desired state of something through some files in git, and have a program running somewhere that monitors the actual vs this desired state and constantly strives to make the actual state match the desired state.

It has already been applied in many large production-grade systems (based on talks at conferences) but as with any marketing and publicity, the details matter. What do you *actually* need to make gitops work?

I'm not going to provide a *definitive* answer to this question because the landscape is constantly changing. So I'm just going to provide some thoughts that represent my own journey through gitops. I may update this post as my journey evolves, or I might write a new post, I'll see!

Aspects to consider:

  • If you want to be able to replicate an environment (for troubleshooting, testing etc), you need *everything* to be captured in git *in some fashion* (TBD below).

    • This means not just the kubernetes manifests for your micro-service; it also includes the tools used to activate the target state (such as kubectl, helm, spinnaker).

    • Here *in some fashion* should be interpreted liberally: it could mean eg that you have a Dockerfile that builds a docker image with the specific versions of the tools, AND your gitops uses that container to reach the desired state, AND it also records what docker tag of that container was used in the metadata of the cluster.

  • Gitops is often discussed in the context of kubernetes because the controller model of kubernetes is a natural fit for gitops. HOWEVER, in reality there are many deployment-related resources that live outside of a kubernetes cluster:

    • databases and other forms of storage (blobs like S3, caching services like AWS redis)

    • message queues like SQS

    • lambdas / serverless functions

    • security groups / firewall rules

    • networking eg a micro service being extended might need a new VPC endpoint

    • etc.

  • The non-cluster infra described in previous point must be versioned too, and similarly anything not micro-service code.

    • docker files

    • test files

    • helm charts

    • values files

    • kustomize files

    • terraform files

    • k8s manifests

    • jenkins files and pipeline definitions in general

    • pulumi / Python scripts

    • cloudformation templates

    • OPA / policy files

    • etc

  • A lot of git commits are typically needed on a feature breanch before a PR is needed. Yet developers need to run the compute associated with these commits in a representative "QA" or "sandbox" environment, without waiting to get to PR stage. So there has to be a mechanism for developers to deploy new code into live sandbox without a PR. Yet, it is not safe to give them carte blanche either, because kubernetes provides access to huge compute resources which could get activated even inadvertently (by both junior and experienced, for a variety of reasons -- to err is human!).

  • When you have multiple developers working on features, they need a way to know what failed in their deployment.

    • This means that the gitops system must capture the logs of its deployments so that they are still accessible after a failed service deployment gets rolled back, so the next developer can deploy into a working environment.

    • What if the developer needs to troubleshoot? Clearly they can't deploy into the integration environment of the CI/CD. They need to be able to create an environment on demand and deploy into it.

    • How are they going to know the context of that environment, ie the versions of all the other pods that were running at the moment that the test failed in the gitops/ci/cd deployment? The gitops must capture, upon a failure of deployment / test, the versions of all other services running. Moreover, the state of the databases must be captured, even possibly the state of caches. This is getting hard!

  • Running something from the command line should be allowed, but there should be a sentinel that alerts if drift is not removed after a certain amount of time (eg an hour).

  • There does not seem to be a software that does all of the above

  • The software available to tackle some of the above are rather big and involve a significant amount of ramp-up time with understanding how to map your requirements to their capabilities to their declarative language. Not to mention, if something does not work, these tools involve an additional layer to troubleshoot.

  • Intuit apparently uses 3 repos per micro-service! Although one of the 3 seems to be spring-boot specific.

  • The infrastructure related to gitops must be re-instated if the cluster gets lost / damaged!

    • This means that the gitops setup must also be captured in git. And if you have 100 clusters, do you really want each one to have its own gitops controller and dashboard, and config in a git repo?

    • It seems that it would be more effective to have one cluster dedicated to gitops, which can therefore have different life cycle requirements and constraints than the deployment environments, and would present one dashboard. In that case, the gitops agent would get a user/group in k8s RBAC.

  • Keeping the infra code separate from the micro-service git repo has pros but also cons: the build artifact may require changes to the infra code that will be done asynchronously hence for a while, the gitops will attempt to deploy a docker image that cannot work and will have to roll back, this is wasteful and adds audit noise

  • The use of git as an event "bus" seems to complicate the gitops architecture. Eg I have built a simple gitops system that uses one git repo per micro-service, contains the helm chart and the values and sops-encoded secrets for the different environments (qa, staging, prod).

    • In that case, there is no need for gitops tools to write to git repos, which in turn enables one repo rather than 2; the helm chart / values can be updated at same time as the micro-service, the PR is a great place to see what the developer is proposing, everything is in one place, close to the code. The test code and data is in the git repo; so should the ci/cd code and data!

    • Some might argue that this is dangerous, and that you want to control what happens at the infra level. I argue that the spirit of devops is to narraw the gap between dev and ops; now gitops via tools like flux and argocd seem to be widening that gap by requiring separate repos that can have PRs for infra etc. Seems very onerous.

  • Ironically, pull requests are not native to git. They are a controlled merge, based on a service provided by a third-party such as github, bitbucket, gitlab etc.

About Weaveworks paper "Automating Kubernetes with GitOps":

  • Stronger security guarantees:

    • most orgs that I know do not use signed commits.

    • git is not completely immutable; like anything it can be hacked, controls must be setup (typically manually for each repo), and history can be rewritten (eg rebase, reset, push force, etc). Since git becomes the entry point to deployment into some clusters, it is just a matter of time before these aspects get exploited.

    • "separation of responsibility between packaging software and releasing it to a production environment embodies the security principle of least privilege": this is true, but can also increase the divide between dev and ops. Only weakest link needs to be exploited since gitops creates a chain that can end in prod just by checking in code, no need to know anything about the deployment environments (passwords, commands to run, registries, etc); in fact, the gitops minimizes (literally) the amount of knowledge an attacker needs in order to change the system.

  • Reduced mean time to recovery: assumes many of the things I described earlier are in git too!

  • What you need for gitops:

    • declarative description of entire system; as describe above, this also include third-party dependencies including the gitops setup itself!

    • ability to auto apply approved changes: yes but re "you don’t need specific cluster credentials to make a change to your system. With GitOps, there is a segregated environment that the state definition lives outside of. This allows your team to separate what they actually do from how they are going to do it." I don't totally buy that approach, it requires extra git repos, it interferes with dev agility, it widens gap between dev and ops, it can cause undeployable artifact

  • Seems to focus on PR as the way to validate changes but in reality, PR usually near end of feature branch, yet changes may need to be validated before the branch is PR'd as described earlier

  • CICD pipeline:

    • integration used to refer to the notion of bringing together many components of a system for testing;

    • integ tests in a micro-services env require a live env due to the distributed architecture (at very least, a subset of a container's dependencies to be running in other containers in same network)

    • therefore the integ tests should actually be after the orchestrator, ie from git commit you build the docker image and run unit tests in it (the image is the unit of test), push it to image repo, deploy it via orchestrator, run integration tests, if deployment or tests fail the orchestrator must rollback, whereas if all succeeds there may be further actions such as push image to a "release" repo and/or do a deployment to a staging and/or prod (if tests in staging work).

    • database migrations may be required, but in automated pipeline that must also be automatically triggered; and although they should always be backwards compatible, errors happens and someone is bound to release a service that runs a db migrations that breaks backwards compat and then what happens, how do you deal with this? especially when there is more than one branch being modified in git repo.

    • security problems: CI need not deploy; you can separate ie trigger on new docker image, helm chart, etc. Also, the API creds used by CI tooling will be needed by the CD running inside the cluster, so whereas there is one way to gain access to CI, there are many to gain access to CD (by cracking any of the micro services running in the same cluster as the CD).

    • cluster goes down: rebuild all images? that's silly. Eg if you use jenkins to build and spinnaker to deploy, all you need to know is the commit hash that was deployed for each service, and re-run the deployment action manually. I build CD so that all deployment actions can be done in one go via a version manifest file. I believe this statement may be a shortcoming that stems from gitops doing both the building and the deployment. Eg it should be possible to tell a good gitops system to deploy the latest master commit of git repo A and B and C since the code is all there, there should be no need to rebuild anything.

    • gitops deployment pipeline "automates complex error prone tasks like having to manually update YAML manifests": helm charts and kustomize files are all based on yaml, so anytime the configuration of a service changes (eg, a new feature requires a new env var to exist), there is a chance the YAML of the chart or kustomize will need modification and possibly refactoring (eg once you reach a point that there are many env vars that could be defined more simply via a loop, the helm chart template will have to be edited significantly, or the kustomize will have to be split heavily, and typically these changes will have to be done to many different charts -- that should be the case if you have naming and infra code conventions)

    • security wise the weakest link is that gitops as described in that doc needs to give write access to the git repos, I really dislike that approach, which seems to stem from the desire to separate dev from ops. With the technique i describe above, there is no need for auto write to git repos, thus improving security even further. Further, prod cluster is public facing and that is the cluster that has gitops with write access to git!!! I bet my money that this will cause much pain in the next few years as this gets exploited.

    • the separation between CI and CD-via-gitops as separation between dev and deployments is actually incorrect: integration comes from joining together pieces into a system, and this requires deployment; the whole notion that CI and CD are two separate things is in fact on shaky grounds in the world of containerized services.

    • the gitops workflow described does not match my experience; as described earlier, a typical developer will need to deploy their service in a live environment where they can "integrate" it with real (or at least mock or fixture) DB and/or upstream/downstream services; and they will need to do that without a PR on their branch.

    • pages 12-14 are a marketing pitch for Weaveworks Enterprise K8S Platform (nothing wrong with that, just nothing in those pages that is relevant to this journey).