Unable to Build: Image Authentication Issues garden.io #🌱

Unable to Build: Image Authentication Issues

clever-policeman-58407

12/04/2024, 10:38 PM

I've been unable to deploy for quite a while, and it seems to be centered around ECR authentication issues. Logs:

Copy code

ℹ deploy.foobar-api      → Aborting because upstream dependency failed.
✖ build.foobar           → Failed resolving status for Build type=container name=foobar (took 23.36 sec). This is what happened:

────────────────────────────────────────────────────────────────────────────────────────────────────
Unable to query registry for image status: Command "skopeo --command-timeout=30s inspect --raw --authfile ~/.docker/config.json docker://<aws_id>.dkr.ecr.us-west-2.amazonaws.com/garden/foobar:v-710dceec9b" failed: Failed with exit code 1.

Here are the logs until the error occurred:

time="2024-12-04T21:32:21Z" level=fatal msg="Error parsing image name \"docker://<aws_id>.dkr.ecr.us-west-2.amazonaws.com/garden/foobar:v-710dceec9b\": reading manifest v-710dceec9b in <aws_id>.dkr.ecr.us-west-2.amazonaws.com/garden/foobar: authentication required"

I have tried: - Variations of restarting the daemon and accompany services:

garden util sync daemon stop

+ `etc. - Upgrading / Fresh Reinstallation of Local machine's Garden CLI - Forced deletion and recreation of ECR secrets / configuration - Reinstallation of local Docker I can say with 100% certainty that the Docker CLI and authentication is working as expected, as I can pull and push images to the same registry using the same credentials without fail if I run the commands directly. Garden, however, is not able to do the same and consistently returns the above error. In my testing, the only thing that seems to fix this issue is completely deleting the cluster that has this error. However once recreated, the error returns after some unknown amount of time and is only fixed by deletion. Can also confirm that running the skopeo command it fails on returns the same error when run directly outside of Garden. Potentially an upstream bug? Is this a known problem?

brief-restaurant-63679

12/05/2024, 1:40 PM

Hi @clever-policeman-58407 👋 At a glance I don't see what the issue is but we only use Skopeo if the build mode is

kaniko

. Would switching to

cluster-buildkit

work for your use case? It's configured in the

kubernetes

provider project level provider config like so:

Copy code

kind: Project
name: my-project
#...
providers:
  - name: kubernetes
    buildMode: cluster-buildkit # <--- Set this value

With the

cluster-buildkit

there's a single Buildkit Deployment for each namespace whereas with

kaniko

Garden creates a Pod for each build. We've started recommending the former because we've seen a few stability issues with

kaniko

in the past. Let us know if that helps or if you continue having auth issues. That would at least narrow it down.

clever-policeman-58407

12/08/2024, 11:26 PM

That is...interesting. We actually do use

cluster-buildkit

so I wonder if the actual problem is that it's trying to run Kaniko commands in a buildkit environment

brief-restaurant-63679

12/09/2024, 7:45 AM

Hmm ok, in that case I'm probably missing something. I'll take a closer look on my end

clever-policeman-58407

12/16/2024, 7:34 PM

Hi there, just for clarification:

skopeo

commands shouldn't be being called at all if you're not using

kaniko

little-army-47606

12/18/2024, 10:26 PM

Is there a way to output the perceived

buildMode

? If so, you could verify which build mode garden "believes" it is negotiating.

freezing-pharmacist-34446

01/09/2025, 1:21 PM

I took a closer look and we also execute skopeo in

cluster-buildkit

in the

util

sidecar container to check the build status of an image. The

imagePullSecret

is mounted at

/home/user/.docker

in the util sidecar container. Could you shell into that container and take a look at the contents of this file @clever-policeman-58407 ?

clever-policeman-58407

01/10/2025, 6:49 PM

{"experimental":"enabled","auths":{},"credHelpers":{"<account_id>.dkr.ecr.us-west-2.amazonaws.com":"ecr-login"}}

clever-policeman-58407

01/10/2025, 6:49 PM

As I understand it, the experimental bit is related to file-syncing-- is that the issue?

freezing-pharmacist-34446

01/13/2025, 9:55 AM

Are you using [IRSA](https://docs.garden.io/kubernetes-plugins/guides/in-cluster-building#using-in-cluster-building-with-irsa-iam-roles-for-service-accounts) to authenticate and authorize to ECR or are you [attaching a policy to the nodegroup of the EKS cluster](https://docs.garden.io/kubernetes-plugins/guides/in-cluster-building#configuring-access) ? If the latter, is there any chance not all nodes have this IAM policy attached e.g. spot instances, different node groups etc?

freezing-pharmacist-34446

01/13/2025, 9:58 AM

experimental should not be a problem, this is always enabled

clever-policeman-58407

01/14/2025, 12:27 AM

We assign all our nodes a PowerUser ECR role at the InstanceProfile level, meaning all nodes have full access to ECR from the second they come up. We've been doing a lot of troubleshooting on our end and have mostly ended up running in circles... what we know first and foremost is that the usage of Skopeo is a blocker for our builds even if we can't really explain why or how. Previously, we would see Garden builds fail when running a skopeo command and then attempt to run the same command directly from our local terminals. That command would also fail when run locally. Currently, we're observing that Garden builds will fail and then sometimes succeed when run from our local terminals. We've changed nothing at any level of the stack (no AWS permissions changes, no Garden configuration changes, no Kubernetes YAML changes, no CLI version changes) which makes this even more inexplicable.

clever-policeman-58407

01/14/2025, 12:28 AM

For reference, the versions in question:

garden version: 0.13.47

skopeo version 1.16.1

clever-policeman-58407

01/14/2025, 12:36 AM

Just to get an idea of whether or not our dev configuration might be the issue, is it considered an atypical usecase to use Garden along with the [ECR Credentials Helper](https://github.com/docker/docker-credential-helpers) ?

clever-policeman-58407

01/14/2025, 12:37 AM

We were using this method for more than a year without issue, but it never hurts to verify.

freezing-pharmacist-34446

01/14/2025, 1:14 PM

Okay thanks, just needed to make sure about the AWS stuff but definitely doesn't look like the culprit. Sorry to hear you are running in circles with this one and i definitely want to help resolve this issue for you. I just tried to reproduce the issue with EKS, ECR credentials-helper and cluster-buildkit but it worked on my end. Using ecr-credentials-helper is not atypical at all, we have been using this method for a long time ourselves. You mentioned it only starts happening after a while?

little-army-47606

01/23/2025, 8:42 PM

@freezing-pharmacist-34446 it appears to be consistent, but there was a day where, after weeks of not being able to build with

garden deploy --env remote

I was suddenly able to. It was after a return from holiday, and it only successfully built for a short while that day. @clever-policeman-58407 and I are on the same team, if that's not already clear.

little-army-47606

01/23/2025, 8:42 PM

I am able to run the

skopeo

commands from my host, but the garden builds fail to.

little-army-47606

01/23/2025, 8:50 PM

on my host (Mac), my

~/.docker/config.json

reads:

Copy code

{
  "auths": {},
  "credsStore": "ecr-login",
  "credHelpers": {
    "<account_id>.dkr.ecr.us-west-2.amazonaws.com": "ecr-login"
  },
  "currentContext": "desktop-linux",
  "plugins": {
    "-x-cli-hints": {
      "enabled": "true"
    }
  }
}

little-army-47606

01/23/2025, 8:57 PM

from /garden-buildkit- container: util , the config.json is :

{"experimental":"enabled","auths":{},"credHelpers":{"<account_id>.dkr.ecr.us-west-2.amazonaws.com":"ecr-login"}}

little-army-47606

01/23/2025, 9:22 PM

node is installed locally at version

23.4.0

little-army-47606

01/23/2025, 9:32 PM

Ultimately, we see this on all our services when attempting deployments with

garden deploy <repo_name> --env remote --force-refresh

little-army-47606

01/23/2025, 9:32 PM

https://cdn.discordapp.com/attachments/1313997959121211463/1332100629484535828/message.txt?ex=679406e3&is=6792b563&hm=406d4afa18b2b3d25e7c2e488e70a10454db46d9337fab672c43540d968c818f&

big-spring-14945

01/28/2025, 1:23 PM

Hi everyone, this is definitely not a full solution to the problem but I created a PR that assumes that the image needs to be rebuilt if the skopeo command fails; At least this can help avoid the error, until we understand what's going on exactly. https://github.com/garden-io/garden/pull/6810

big-spring-14945

01/28/2025, 1:24 PM

I'll notify you once this has been merged

little-army-47606

01/29/2025, 6:29 PM

Thanks, team!

freezing-pharmacist-34446

02/06/2025, 12:39 PM

Hi @little-army-47606 , the PR mentioned above has been merged and available in the edge release. Can you upgrade to edge-bonsai and see if that resolves your issue?

47 Views

Previous Next