Is it possible 0.13 is marking builds as done befo...
# 💻|contributing
p
Is it possible 0.13 is marking builds as done before they are actually complete?
Copy code
ℹ api                       → Building version v-4a71816b6b...
✔ api                       → Done (took 0 sec)
ℹ api                       → File sync to cluster complete
ℹ api                       → Building image api:v-4a71816b6b...
The API also receives an event marking it as complete while the image is still building
a
I see. Could well be a bug.
p
I think it might be related to
--force-build --force
It does the build but also emits events that state that it's already done
a
That seems plausible
s
I can have a look at this one if you haven’t already, @alert-helicopter-61082
a
That would be much appreciated
s
Had a quick look at this, and it seems that the problem was partly a matter of logging: https://github.com/garden-io/garden/pull/3995 That said, I noticed a bit of awkwardness in the action states when forcing actions: When force-building e.g. a Build that's already cached, we emit
buildStatus
events with the following action states:
getting-status -> cached -> processing -> ready
. The idea was for the
cached
status to be distinguishable from the
ready
status, but it does make for an awkward intermediate state in Cloud (after the
cached
event is received but before the
processing
event is received). That is, there's no way for Cloud to know whether the action is complete (and show e.g. a green colour), which is the case for cache hits when not forcing, or whether a
processing
event is expected (which is the case when forcing). We do need to perform the status check when forcing, since the "process action" handlers receive a status value as a param. Maybe a good compromise would be to simply not emit status events for the status check action (e.g. "get build status") when the action is being forced?
Or maybe the metadata about whether the action is being force-run could be read/inferred from elsewhere. Feels a bit weird to skip the "getting status" events.
Maybe we should include more metadata on the status events, e.g. a
force: boolean
and whether a Deploy is being deployed with sync or local mode.
p
We also saw a similar but different issue where if I recall correctly we got a
completedAt
on an event that was in the
outdated
state, making cloud think that the task is completed, only to then get further events that bring it back into the pending state. We probably shouldn't mark anything as completed if the process still continues on afterwards.
s
Yeah. The routers now include a
completedAt
timestamp for status events emitted during the "get status" methods—the idea there was to enable Cloud to calculate how long the status checks are taking. The
completedAt
timestamp means: "This status check was completed at" or "This execution was completed at"—whether it was a status check or an execution is clear from the state (status checks emit an
outdated
or
cached
final state, whereas executions emit a
ready
or
failed
final state). I.e. we want to track both the durations of the status checks and the actual builds/deploy/runs/tests that happen afterwards. The action state in the event should be enough to disambiguate between those two.
p
I think it's not entirely ideal that part of the logic of figuring out completion etc needs to be repeated again on the API side and that the same property may mean that a status check was completed vs an execution. In the end we need to bring status checks and the actual execution into the same sort of task result to store to show the current status, so they are effectively parts of the same process. Now the process can have two
completedAt
states, depending on what part it's in right now. Also as you mentioned above, we can emit
cached
and still go into
processing
so it gets even more confusing. Maybe we can just not emit it as you suggested, or if garden knows that it's still doing more work after, it just doesn't emit a
completedAt
event, or we add some extra property that tells us that the process is fully terminated with a success or error state.
a
So, the events that are emitted in the graph solver are not exactly intended outright for emitting to Cloud, and I regard them as an internal API that is not guaranteed to be stable. I think we should be much more explicit about the events sent to Cloud, and to decouple them better from internal APIs.
I would suggest that we have something in core that listens for solver events (and anything else of relevance) and then emits events with a stable schema that is either outright dictated by Cloud or at least maintained separately in Core.
p
I think that makes sense so that we can ensure stability even when the internals change for whatever reason. Do you think that's something we should make part of the 0.13 release, or should we go ahead with the current schema and work around the tricky bits for now?
c
I have the feeling we are mixing up semantics here. An action shouldn't be completed if a graph solver event (check status) is completed. I understand that we will have intermediate steps, but why is completedAt set while the action is executed. Or better, I understand why is that, but it feels weird to have completedAt as a field with multiple meanings.
b
Late to the party, but yes! We've been saying this for a long time. We should be a lot more intentional about what we send to Cloud and ideally have that be authoritative in terms of the API shape. As opposed to just consuming everything and trying to untangle it Cloud side. Currently we need to maintain pretty complex update logic Cloud side that IMO should be handled on the client. I think we can and should make those changes on 0.13, it can be a part of a minor release for that matter. But is mostly decoupled from other 0.13 changes.
It can be a simple as making an API call when we start an action and another when we complete it. That kind of mechanism can also be used more generally as hooks for other tools to plug into. But we can of course also batch or streams things to reduce the number requests.
s
> It can be a simple as making an API call when we start an action and another when we complete it. This would definitely simplify things, but would also add a network call before every status check and every execution. Maybe that's acceptable, maybe not. Other than that, I think we're already being quite intentional about what we're sending to cloud (the status events, which were specifically designed to make things simple for the API, plus the log events), but if we can agree on a concrete proposal that would simplify things, I'm all ears.
Also, the status events are independent of the solver itself—they're emitted by the build/deploy/run/test routers.
b
> but would also add a network call before every status check and every execution > I also suggested we batch or stream them. In fact the current buffer event stream logic works just fine. My point is rather that we don't send internal stuff that's subject to change, to JĂłn's point. But let just get this working now and revisit. We've wanted to do an "audit" on these events for the longest time. Now is not that time, but we can follow up once things are working.