I’m trying to see if we can use garden for CI, and I noticed that when there are roughly 6+ garden commands running at the same time, I start to see some/all of these behaviors:
- Garden spends a lot longer retrieving the provider status, setting up dependencies, retrieving test results after they finish, cleaning up (basically every step becomes slower). Sometimes they become completely stuck (yesterday I left some jobs running to see if they could unstuck themselves, and one of them ran for 8+ hours until I cancelled it)
- Errors out saying it can’t create a namespace or it timed out waiting for a dependency to come up, but the namespace/dependency are created just fine
- For one of my tests that require a lot of dependencies (we test connections to many test dbs), garden would exit with code 0 during the setup phase.
My setup is:
- Buildkite for CI, with build agents on an Azure Kubernetes (AKS) cluster. I use the same cluster for garden tests. The build agents (which runs garden commands) are running as Pods, they have 1cpu/500Mi memory. I started with 500m cpu and increased to 1 to see if more cpu would help with this issue, but it doesn’t look like it did. Also from kube metrics it looks like the agents are not using that much at all.
- I have ~6 garden tests defined for frontend and backend (some of which are just lint checks), and e2e tests which require some dependencies to be set up / tasks to run before the tests run. For each of the garden tests, I use a separate agent to run a garden command to trigger just that test. This is so that our engineers can easily see which parts of the pipelines are finishing, what’s failing, collect artifacts etc. So for Frontend, I would have 6 different build agents running 6 garden commands for different test suites/lint checks.
A few things I’ve tried:
- trigger the same garden commands locally against the same cluster. I still see some slowdowns, but it’s definitely way better than when they’re running in CI agents. E.g. I can’t run 3 sets of my e2e tests (which has 3 service dependencies and 2 tasks to run before the tests start) in CI without some getting completely stuck, but locally I can run at least 6 sets concurrently and
they only become a little bit slowertested a bit more and it seems like locally 4 out of the 6 concurrent tests struggle to retrieve the finished status of the tests and are stuck. But at least they could ran all the setup steps and start the tests just fine…in CI they get stuck way earlier
- Increase the resource limits of the build agent pods, this didn’t seem to help, but maybe I should try increasing even more
- See if the slow down only occurs when there are dependencies to be set up or for average single container tests too. It definitely affects even the simplest cases when there are 6 of them running at the same time.
We’ve only tried garden for a week, but we’re really loving it so far and would love to use it on CI and also explore the possibility of using it for our dev envs or even deploys, but this problem is a hard blocker for us. We’d really appreciate some help on this Thanks in advance!