Garden commands become slow/stuck/errors when ran in parallel

Hi again!

I’m trying to see if we can use garden for CI, and I noticed that when there are roughly 6+ garden commands running at the same time, I start to see some/all of these behaviors:

  • Garden spends a lot longer retrieving the provider status, setting up dependencies, retrieving test results after they finish, cleaning up (basically every step becomes slower). Sometimes they become completely stuck (yesterday I left some jobs running to see if they could unstuck themselves, and one of them ran for 8+ hours until I cancelled it)
  • Errors out saying it can’t create a namespace or it timed out waiting for a dependency to come up, but the namespace/dependency are created just fine
  • For one of my tests that require a lot of dependencies (we test connections to many test dbs), garden would exit with code 0 during the setup phase.

My setup is:

  • Buildkite for CI, with build agents on an Azure Kubernetes (AKS) cluster. I use the same cluster for garden tests. The build agents (which runs garden commands) are running as Pods, they have 1cpu/500Mi memory. I started with 500m cpu and increased to 1 to see if more cpu would help with this issue, but it doesn’t look like it did. Also from kube metrics it looks like the agents are not using that much at all.
  • I have ~6 garden tests defined for frontend and backend (some of which are just lint checks), and e2e tests which require some dependencies to be set up / tasks to run before the tests run. For each of the garden tests, I use a separate agent to run a garden command to trigger just that test. This is so that our engineers can easily see which parts of the pipelines are finishing, what’s failing, collect artifacts etc. So for Frontend, I would have 6 different build agents running 6 garden commands for different test suites/lint checks.

A few things I’ve tried:

  • trigger the same garden commands locally against the same cluster. I still see some slowdowns, but it’s definitely way better than when they’re running in CI agents. E.g. I can’t run 3 sets of my e2e tests (which has 3 service dependencies and 2 tasks to run before the tests start) in CI without some getting completely stuck, but locally I can run at least 6 sets concurrently and they only become a little bit slower tested a bit more and it seems like locally 4 out of the 6 concurrent tests struggle to retrieve the finished status of the tests and are stuck. But at least they could ran all the setup steps and start the tests just fine…in CI they get stuck way earlier
  • Increase the resource limits of the build agent pods, this didn’t seem to help, but maybe I should try increasing even more
  • See if the slow down only occurs when there are dependencies to be set up or for average single container tests too. It definitely affects even the simplest cases when there are 6 of them running at the same time.

We’ve only tried garden for a week, but we’re really loving it so far and would love to use it on CI and also explore the possibility of using it for our dev envs or even deploys, but this problem is a hard blocker for us. We’d really appreciate some help on this :pray: Thanks in advance!

Hi Anna!

This sounds like you could be running into the Kubernetes API throttling the number of API requests—that would explain why adding more resources to the test pods doesn’t seem to help, and would fit the profile of the sorts of slowdowns you described there.

I suspect that executing the pipeline using fewer Garden commands would fix this, since that way we can make better use of the in-process graph cache.

We’d be happy to hop on a call to look at your setup together to dig into the details a bit and see what optimizations we can come up with.

Cheers,
Thor

Hmm, that makes sense. I spun up an EKS cluster to run the same tests to compare, and I could run 13 sets of the same tests fired simultaneously without a problem lol. I should have a chat with AKS support. But also would love to chat with you guys about our set up! After all it’s unlikely that we’ll use EKS just for this one thing. I submitted the contact form on the garden website on friday actually, just let me know when’s a good time for you guys :slight_smile:

1 Like

Update for people using AKS:

After talking to support, we made 2 changes that seem to have made the slowness better (but still not as good as on EKS):

  1. Use the “Paid” SKU tier on the cluster (it defaults to Free) https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster#sku_tier

  2. There were SNAT connection errors when I was testing, so I made a NAT gateway for the subnet that the cluster is using. The NAT gateway is, supposedly, “highly extensible, reliable, and doesn’t have the same concerns of SNAT port exhaustion”.

There actually wasn’t anything indicating that the k8s API server was throttling us (according to azure support).