Unable to Build: Image Authentication Issues
# 🌱|help-and-getting-started
c
I've been unable to deploy for quite a while, and it seems to be centered around ECR authentication issues. Logs:
Copy code
β„Ή deploy.foobar-api      β†’ Aborting because upstream dependency failed.
βœ– build.foobar           β†’ Failed resolving status for Build type=container name=foobar (took 23.36 sec). This is what happened:

────────────────────────────────────────────────────────────────────────────────────────────────────
Unable to query registry for image status: Command "skopeo --command-timeout=30s inspect --raw --authfile ~/.docker/config.json docker://<aws_id>.dkr.ecr.us-west-2.amazonaws.com/garden/foobar:v-710dceec9b" failed: Failed with exit code 1.

Here are the logs until the error occurred:

time="2024-12-04T21:32:21Z" level=fatal msg="Error parsing image name \"docker://<aws_id>.dkr.ecr.us-west-2.amazonaws.com/garden/foobar:v-710dceec9b\": reading manifest v-710dceec9b in <aws_id>.dkr.ecr.us-west-2.amazonaws.com/garden/foobar: authentication required"
I have tried: - Variations of restarting the daemon and accompany services:
garden util sync daemon stop
+ `etc. - Upgrading / Fresh Reinstallation of Local machine's Garden CLI - Forced deletion and recreation of ECR secrets / configuration - Reinstallation of local Docker I can say with 100% certainty that the Docker CLI and authentication is working as expected, as I can pull and push images to the same registry using the same credentials without fail if I run the commands directly. Garden, however, is not able to do the same and consistently returns the above error. In my testing, the only thing that seems to fix this issue is completely deleting the cluster that has this error. However once recreated, the error returns after some unknown amount of time and is only fixed by deletion. Can also confirm that running the skopeo command it fails on returns the same error when run directly outside of Garden. Potentially an upstream bug? Is this a known problem?
b
Hi @clever-policeman-58407 πŸ‘‹ At a glance I don't see what the issue is but we only use Skopeo if the build mode is
kaniko
. Would switching to
cluster-buildkit
work for your use case? It's configured in the
kubernetes
provider project level provider config like so:
Copy code
kind: Project
name: my-project
#...
providers:
  - name: kubernetes
    buildMode: cluster-buildkit # <--- Set this value
With the
cluster-buildkit
there's a single Buildkit Deployment for each namespace whereas with
kaniko
Garden creates a Pod for each build. We've started recommending the former because we've seen a few stability issues with
kaniko
in the past. Let us know if that helps or if you continue having auth issues. That would at least narrow it down.
c
That is...interesting. We actually do use
cluster-buildkit
so I wonder if the actual problem is that it's trying to run Kaniko commands in a buildkit environment
b
Hmm ok, in that case I'm probably missing something. I'll take a closer look on my end
c
Hi there, just for clarification:
skopeo
commands shouldn't be being called at all if you're not using
kaniko
?
l
Is there a way to output the perceived
buildMode
? If so, you could verify which build mode garden "believes" it is negotiating.
f
I took a closer look and we also execute skopeo in
cluster-buildkit
in the
util
sidecar container to check the build status of an image. The
imagePullSecret
is mounted at
/home/user/.docker
in the util sidecar container. Could you shell into that container and take a look at the contents of this file @clever-policeman-58407 ?
c
{"experimental":"enabled","auths":{},"credHelpers":{"<account_id>.dkr.ecr.us-west-2.amazonaws.com":"ecr-login"}}
As I understand it, the experimental bit is related to file-syncing-- is that the issue?
f
Are you using [IRSA](https://docs.garden.io/kubernetes-plugins/guides/in-cluster-building#using-in-cluster-building-with-irsa-iam-roles-for-service-accounts) to authenticate and authorize to ECR or are you [attaching a policy to the nodegroup of the EKS cluster](https://docs.garden.io/kubernetes-plugins/guides/in-cluster-building#configuring-access) ? If the latter, is there any chance not all nodes have this IAM policy attached e.g. spot instances, different node groups etc?
experimental should not be a problem, this is always enabled
c
We assign all our nodes a PowerUser ECR role at the InstanceProfile level, meaning all nodes have full access to ECR from the second they come up. We've been doing a lot of troubleshooting on our end and have mostly ended up running in circles... what we know first and foremost is that the usage of Skopeo is a blocker for our builds even if we can't really explain why or how. Previously, we would see Garden builds fail when running a skopeo command and then attempt to run the same command directly from our local terminals. That command would also fail when run locally. Currently, we're observing that Garden builds will fail and then sometimes succeed when run from our local terminals. We've changed nothing at any level of the stack (no AWS permissions changes, no Garden configuration changes, no Kubernetes YAML changes, no CLI version changes) which makes this even more inexplicable.
For reference, the versions in question:
garden version: 0.13.47
skopeo version 1.16.1
Just to get an idea of whether or not our dev configuration might be the issue, is it considered an atypical usecase to use Garden along with the [ECR Credentials Helper](https://github.com/docker/docker-credential-helpers) ?
We were using this method for more than a year without issue, but it never hurts to verify.
f
Okay thanks, just needed to make sure about the AWS stuff but definitely doesn't look like the culprit. Sorry to hear you are running in circles with this one and i definitely want to help resolve this issue for you. I just tried to reproduce the issue with EKS, ECR credentials-helper and cluster-buildkit but it worked on my end. Using ecr-credentials-helper is not atypical at all, we have been using this method for a long time ourselves. You mentioned it only starts happening after a while?
l
@freezing-pharmacist-34446 it appears to be consistent, but there was a day where, after weeks of not being able to build with
garden deploy --env remote
I was suddenly able to. It was after a return from holiday, and it only successfully built for a short while that day. @clever-policeman-58407 and I are on the same team, if that's not already clear.
I am able to run the
skopeo
commands from my host, but the garden builds fail to.
on my host (Mac), my
~/.docker/config.json
reads:
Copy code
{
  "auths": {},
  "credsStore": "ecr-login",
  "credHelpers": {
    "<account_id>.dkr.ecr.us-west-2.amazonaws.com": "ecr-login"
  },
  "currentContext": "desktop-linux",
  "plugins": {
    "-x-cli-hints": {
      "enabled": "true"
    }
  }
}
from /garden-buildkit- container: util , the config.json is :
{"experimental":"enabled","auths":{},"credHelpers":{"<account_id>.dkr.ecr.us-west-2.amazonaws.com":"ecr-login"}}
node is installed locally at version
23.4.0
Ultimately, we see this on all our services when attempting deployments with
garden deploy <repo_name> --env remote --force-refresh
b
Hi everyone, this is definitely not a full solution to the problem but I created a PR that assumes that the image needs to be rebuilt if the skopeo command fails; At least this can help avoid the error, until we understand what's going on exactly. https://github.com/garden-io/garden/pull/6810
I'll notify you once this has been merged
l
Thanks, team!
f
Hi @little-army-47606 , the PR mentioned above has been merged and available in the edge release. Can you upgrade to edge-bonsai and see if that resolves your issue?
35 Views