Using Kubernetes for development?

epchris · 2 years ago

Using Kubernetes for development?

PriorProject@lemmy.world · 2 years ago

Yes people do this kind of thing, as far as I know they all do it in fairly different ways… but what you’re describing sounds reasonable. Yes, it does tend to be expensive. The whole point, as you note, is that the env has grown to be large so you’re hosting a bunch of personal large environments which gets pricey when. (not if) people aren’t tidy with them.

Some strategies I’ve seen people employ to limit the cost implications:

Narrow the interface. Don’t give devs direct access to the infra, but rather given them build/tooling that saves some very rich observability data from each run. Think not just metrics/logs, but configurable tracing/debugging as well. This does limit certain debugging techniques by not granting full/unfettered access to the environment for your devs, but it now makes clear when an env is “in use”. Once the CI/build job is complete, the env can be reused or torn down and only the observability data/artifacts need to be retained, which is much cheaper.
Use pools of envs rather than personal envs. You still have to solve the problem of knowing when an env is “in use”, and now also have scheduling/reservation challenges that need to be addressed.
Or automatically tear down “idle” envs. The definition of “idle” is going to get complex, and your definitely going to tear down an env that someone still wants at some point. But if you establish the precedent that envs gets destroyed by default after some max-lifetime unless renewed, you can encourage people to treat them as ephemeral resources rather than a home away from home.

None of these approaches are trivial to implement, and all have serious tradeoffs even when done well. But fundamentally, you can’t carry the cavalier attitude of how you treat your laptop as a dev env into the “cloud” (even if it’s a private cloud). Rather, the dev envs need to be immutable and ephemeral by default, those properties need to be enforced by frequent refreshes so people acclimate to the constraints they imply, and you need some kind of way to reserve, schedule, and do idle detection on the dev envs so they can be efficiently shared and reaped. Getting a version of these things that work can be a significant culture shock for eng teams used to extended intermittent debugging sessions and installing random tools on their laptop and having them available forever.

epchris · 2 years ago

Thanks for all of the suggestions!

Right now our guidance is that each developer is given a namespace and a helm chart to install and the wording is such that developers wouldn’t think of it as an ephemeral resource (ie. people have their helm installation up for months, and periodically upgrade it).

It would be nice to have user’s do a fresh install each time they “start” working, and have some way to automatically remove helm installations after a time period, but we do have times where it’s nice to have a longer-lived env because you’d working within some accumulated state.

Maybe there’s something to automatically scaling down workloads on a cadence or after a certain time period, but it would be challenging to figure out the triggers for that.

thelastknowngod@lemm.ee · 2 years ago

You can build a workflow for ephemeral environments with ArgoCD using an applicationset resource with the pull request generator and the CreateNamespace=true sync option.

If a developer opens a pull request, create a generated namespace based on the branch name and PR number, then deploy their changes to the cluster, in the new namespace, automatically.

With github, if there is no activity on a PR after X time frame, you can have the PR closed automatically. When it’s closed, Argo will not see it as an open PR anymore so it will automatically destroy the environment it created. If the dev wants to keep it active or reopen, just do normal git updates to the PR…

Nyefan · 2 years ago

I have found with individual dev environments that they cause many issues with outdated service versions. If you are going this route, I would use ScheduledScaler to shut down dev resources after hours.

I have found much more success with PR deployments - every PR gets deployed and wired up to a PR environment which has a full copy of dev and a copy of each PR which runs for 8 hours after the PR is built (I switch out the deploy for a job). Not every service maps well to this model, but it’s a good 95% solution in my experience.

epchris · 2 years ago

Yeah I agree, but for us this would mean like 30 containers. We’ve tried several times to have some kind of flexible setup where devs could choose which parts to run, but that got complicated with all the various permutations of the containers and devs needing different setups in different situations.

It just became a lot to try and manage/support

Sheldan · 2 years ago

You can look into tilt for local deployment and potentially into some kind of cron job that deletes obsolete namespaces.

darkmugglet@lemm.ee · 2 years ago

If the complete deployment will run on a high end laptop, I would suggest it’s cheaper and easier to make the devs us local development on kind and Docker Desktop. The licenses for Docker will pay for itself, and you will be able to control costs. Also, you will incentize devs to optimize the stack to run in local dev. For the Mac users, it means a $4k laptop.

As a developer, I really like the local dev experience vs deploying everything. On my Mac M2 Ultra, it’s more than fast enough for my env.

Sheldan · 2 years ago

One could also try Rancher Desktop - no need for the licenses there then.

darkmugglet@lemm.ee · 2 years ago

Honestly didn’t know this existed, thanks for the tip.

Also, there is limactl, which works for 99% of the cases.

Truthfully I only use Docker Desktop because Corp pays for it and it’s the supported environment.

Ducks@ducks.dev · edit-2 2 years ago

We use monitoring to check the state of our applications to see if they’re idle. For example, one of our healthchecks looks at if data was placed into the database within a period of time. We know the database is idle if there is no data there within the last month for example. We use Prometheus/AlertManager to let us know in slack about idle resources.

Also we make use of HPA to scale resources, it takes a lot of time to find optimal settings for HPA in my experience. Each product team has a namespace in the cluster. These clusters have like 200 nodes.

We have hundreds of installations of our application across ~8 data centers, we use both Ansible and Golang operators for managing our many resources.