Is it possible 0 13 is marking builds as done before they ar garden.io #💻

Is it possible 0.13 is marking builds as done befo...

polite-fountain-28010

03/24/2023, 1:50 PM

Is it possible 0.13 is marking builds as done before they are actually complete?

Copy code

ℹ api                       → Building version v-4a71816b6b...
✔ api                       → Done (took 0 sec)
ℹ api                       → File sync to cluster complete
ℹ api                       → Building image api:v-4a71816b6b...

polite-fountain-28010

03/24/2023, 1:51 PM

The API also receives an event marking it as complete while the image is still building

alert-helicopter-61082

03/24/2023, 1:51 PM

I see. Could well be a bug.

polite-fountain-28010

03/24/2023, 2:10 PM

I think it might be related to

--force-build --force

polite-fountain-28010

03/24/2023, 2:10 PM

It does the build but also emits events that state that it's already done

alert-helicopter-61082

03/24/2023, 2:10 PM

That seems plausible

swift-garage-61180

03/24/2023, 6:37 PM

I can have a look at this one if you haven’t already, @alert-helicopter-61082

alert-helicopter-61082

03/24/2023, 6:37 PM

That would be much appreciated

swift-garage-61180

03/26/2023, 8:37 PM

Had a quick look at this, and it seems that the problem was partly a matter of logging: https://github.com/garden-io/garden/pull/3995 That said, I noticed a bit of awkwardness in the action states when forcing actions: When force-building e.g. a Build that's already cached, we emit

buildStatus

events with the following action states:

getting-status -> cached -> processing -> ready

. The idea was for the

cached

status to be distinguishable from the

ready

status, but it does make for an awkward intermediate state in Cloud (after the

cached

event is received but before the

processing

event is received). That is, there's no way for Cloud to know whether the action is complete (and show e.g. a green colour), which is the case for cache hits when not forcing, or whether a

processing

event is expected (which is the case when forcing). We do need to perform the status check when forcing, since the "process action" handlers receive a status value as a param. Maybe a good compromise would be to simply not emit status events for the status check action (e.g. "get build status") when the action is being forced?

swift-garage-61180

03/26/2023, 9:03 PM

Or maybe the metadata about whether the action is being force-run could be read/inferred from elsewhere. Feels a bit weird to skip the "getting status" events.

swift-garage-61180

03/26/2023, 9:05 PM

Maybe we should include more metadata on the status events, e.g. a

force: boolean

and whether a Deploy is being deployed with sync or local mode.

polite-fountain-28010

03/27/2023, 9:55 AM

We also saw a similar but different issue where if I recall correctly we got a

completedAt

on an event that was in the

outdated

state, making cloud think that the task is completed, only to then get further events that bring it back into the pending state. We probably shouldn't mark anything as completed if the process still continues on afterwards.

swift-garage-61180

03/27/2023, 11:41 AM

Yeah. The routers now include a

completedAt

timestamp for status events emitted during the "get status" methods—the idea there was to enable Cloud to calculate how long the status checks are taking. The

completedAt

timestamp means: "This status check was completed at" or "This execution was completed at"—whether it was a status check or an execution is clear from the state (status checks emit an

outdated

cached

final state, whereas executions emit a

ready

failed

final state). I.e. we want to track both the durations of the status checks and the actual builds/deploy/runs/tests that happen afterwards. The action state in the event should be enough to disambiguate between those two.

polite-fountain-28010

03/27/2023, 2:31 PM

I think it's not entirely ideal that part of the logic of figuring out completion etc needs to be repeated again on the API side and that the same property may mean that a status check was completed vs an execution. In the end we need to bring status checks and the actual execution into the same sort of task result to store to show the current status, so they are effectively parts of the same process. Now the process can have two

completedAt

states, depending on what part it's in right now. Also as you mentioned above, we can emit

cached

and still go into

processing

so it gets even more confusing. Maybe we can just not emit it as you suggested, or if garden knows that it's still doing more work after, it just doesn't emit a

completedAt

event, or we add some extra property that tells us that the process is fully terminated with a success or error state.

alert-helicopter-61082

03/27/2023, 3:26 PM

So, the events that are emitted in the graph solver are not exactly intended outright for emitting to Cloud, and I regard them as an internal API that is not guaranteed to be stable. I think we should be much more explicit about the events sent to Cloud, and to decouple them better from internal APIs.

alert-helicopter-61082

03/27/2023, 3:28 PM

I would suggest that we have something in core that listens for solver events (and anything else of relevance) and then emits events with a stable schema that is either outright dictated by Cloud or at least maintained separately in Core.

polite-fountain-28010

03/27/2023, 3:36 PM

I think that makes sense so that we can ensure stability even when the internals change for whatever reason. Do you think that's something we should make part of the 0.13 release, or should we go ahead with the current schema and work around the tricky bits for now?

chilly-gigabyte-83853

03/28/2023, 8:49 AM

I have the feeling we are mixing up semantics here. An action shouldn't be completed if a graph solver event (check status) is completed. I understand that we will have intermediate steps, but why is completedAt set while the action is executed. Or better, I understand why is that, but it feels weird to have completedAt as a field with multiple meanings.

brief-restaurant-63679

03/29/2023, 7:52 AM

Late to the party, but yes! We've been saying this for a long time. We should be a lot more intentional about what we send to Cloud and ideally have that be authoritative in terms of the API shape. As opposed to just consuming everything and trying to untangle it Cloud side. Currently we need to maintain pretty complex update logic Cloud side that IMO should be handled on the client. I think we can and should make those changes on 0.13, it can be a part of a minor release for that matter. But is mostly decoupled from other 0.13 changes.

brief-restaurant-63679

03/29/2023, 7:53 AM

It can be a simple as making an API call when we start an action and another when we complete it. That kind of mechanism can also be used more generally as hooks for other tools to plug into. But we can of course also batch or streams things to reduce the number requests.

swift-garage-61180

03/29/2023, 3:11 PM

> It can be a simple as making an API call when we start an action and another when we complete it. This would definitely simplify things, but would also add a network call before every status check and every execution. Maybe that's acceptable, maybe not. Other than that, I think we're already being quite intentional about what we're sending to cloud (the status events, which were specifically designed to make things simple for the API, plus the log events), but if we can agree on a concrete proposal that would simplify things, I'm all ears.

swift-garage-61180

03/29/2023, 3:11 PM

Also, the status events are independent of the solver itself—they're emitted by the build/deploy/run/test routers.

brief-restaurant-63679

03/29/2023, 3:27 PM

> but would also add a network call before every status check and every execution > I also suggested we batch or stream them. In fact the current buffer event stream logic works just fine. My point is rather that we don't send internal stuff that's subject to change, to Jón's point. But let just get this working now and revisit. We've wanted to do an "audit" on these events for the longest time. Now is not that time, but we can follow up once things are working.

4 Views

Previous Next