https://garden.io logo
lockfile issue status
s

swift-garage-61180

05/09/2023, 1:37 PM
@ancient-diamond-80011 @cold-jordan-68753 Did you get any further with the lockfile thing?
a

ancient-diamond-80011

05/09/2023, 1:37 PM
yes and no 😅
s

swift-garage-61180

05/09/2023, 1:37 PM
Maybe we should rope Jón in on this?
a

ancient-diamond-80011

05/09/2023, 1:39 PM
we have a manual kind-of-repro, created a new helper command with
ts
  async action(params: CommandParams<{}, {}>): Promise<CommandResult> {
    const { garden } = params
    const store = garden.globalConfigStore
    // not awaiting here
    const configPromise = store.set("clientAuthTokens", "throw.example.com", {
      token: "",
      refreshToken: "",
      validity: new Date(),
    })

    throw new Error("oops")
  }
and then adding some
sleep()
to the
set()
method on the configstore, and then manually removing the lock directory with
rmdir
while the command is sleeping, because then in the
finally
block of
cli.ts
we call
delete
for the activeProcesses
but sadly the "repro" requires manually deleting the lock file - we haven't managed to reproduce the
ENOENT
and
ECOMPROMISED
combo with just the sleeps and throws and sets/deletes
need to figure out what can cause the lockfile to be removed before the finally block here runs https://github.com/garden-io/garden/blob/80f5b2f9ad8dbcc53be553c35220efac433b522e/cli/src/cli.ts#L48
also of note: we get the yoga layout error any time anything throws for any reason, so it is not related to the lockfile issue per se
/Users/walther/git/garden/node_modules/yoga-layout-prebuilt/yoga-layout/build/Release/nbind.js:53
        throw ex;
        ^
s

swift-garage-61180

05/09/2023, 2:00 PM
Gotcha. Could you write a more detailed description of the error and everything we know about it, and then tag Jón here?
... just to pre-empt any questions on what we're talking about here
a

ancient-diamond-80011

05/09/2023, 2:03 PM
sadly this is pretty much the extent of what we know :<
- we use the
globalConfigStore
somewhere - somewhere, somehow we get a race condition - something tries to update a lock that has already been deleted, or delete a lock that has already been deleted - very likely this is happening in the
finally
block of
cli.ts
s

swift-garage-61180

05/09/2023, 2:05 PM
Let's have a chat about this later, @alert-helicopter-61082
We need to get this fixed as soon as possible, so let's keep at it until we've gotten to the bottom of it
... sort of can't ship Bonsai with this bug in it
a

ancient-diamond-80011

05/09/2023, 2:07 PM
another possibility we explored was whether we could be hitting the
isOverThreshold
conditions in the
proper-lockfile
library in the
updateLock
method, possibly from e.g. heavy synchronous code overloading the eventloop and getting time skew on the
setTimeout
calls, leading to a difference in the expected time vs real time
but that is most likely not the case, considering in the real-world cases we've seen, we have definitely seen the
ENOENT
as part of the error messages
well, in the worst case, we could also potentially work around this and try to just provide a nicer error message experience. @cold-jordan-68753 opened a draft PR for that https://github.com/garden-io/garden/pull/4211
reproducing this issue has been very hard - unless we figure out a reliable repro, it is probably nearly impossible to get to the bottom of this
also of note: the last commit on this library is from Jan 2021 https://github.com/moxystudio/node-proper-lockfile - it is not actively maintained and hasn't been for a couple of years. sadly we didn't really find other promising library options either
we also took a brief look at the
onExit
handler in the lockfile library, but it's also probably not related - considering we arrive at this error without external signals such as Ctrl-C or similar
so while there are open issues like https://github.com/moxystudio/node-proper-lockfile/issues/111 and pr https://github.com/moxystudio/node-proper-lockfile/pull/112 it is most likely unrelated - i believe the
onExit
handler will not be called in case of a thrown exception, only on signals (but i could be wrong here)
(and besides, in the
onExit
handler any thrown errors are caught and not bubbled up, so we wouldn't see any errors from there)
a

alert-helicopter-61082

05/09/2023, 2:30 PM
I think one clear thing to do is to try/catch around some of the usages of the config stores, where we're sort of fine with a failure. A good example of just that is the cleaning up of the active process record. That plus manually calling the
onExit
handler (if we can) at the end of a CLI execution.
Those would be the first things I'd try at least
c

cold-jordan-68753

05/09/2023, 2:44 PM
yeah, we can add that to the code, there is also an
onCompromised
callback that we can potentially use instead of or in addition to the try/catch.
a

alert-helicopter-61082

05/09/2023, 2:49 PM
Regarding the yoga thing, we may just need to update Ink: https://github.com/vadimdemedes/ink/releases
Ugh crap, it's now pure ESM... I wonder if there's anything we can do more easily to support individual packages that are like that
The docs suggest we can use
await import(...)
for it, which should be fine for our use case I think
a

ancient-diamond-80011

05/09/2023, 2:53 PM
yeah the ESM stuff is becoming less and less avoidable at a fast pace :/
s

swift-garage-61180

05/09/2023, 3:16 PM
The mandatory ESM thing strikes me as a bizarre design decision. What's the value in forcing users to adapt to it? Does it really simplify anything?
Anyway, that's an irrelevant tangent—my bad
a

ancient-diamond-80011

05/09/2023, 3:17 PM
well, regardless of the reasons, it is getting increasingly common in the ecosystem and it will be impossible to avoid when updating a lot of our old packages. we have to make our build process support it eventually (and hopefully fairly soon)
s

swift-garage-61180

05/09/2023, 3:17 PM
Yeah
@ancient-diamond-80011 @cold-jordan-68753 Are we getting closer to a fix here?
a

ancient-diamond-80011

05/10/2023, 8:16 AM
some small steps towards a better direction earlier this morning, and continuing to pair right after this core daily call i'm on right now
current hypotheses - lockfile library manages to throw outside a promise (no reject) - yoga library overrides the uncaught exception handler of node - so in some cases we are unable to catch errors in our cli.ts - we just get properly thrown out of node
s

swift-garage-61180

05/10/2023, 8:36 AM
Ok, gotcha. I think it may be time to recruit more team members to pile on top of this.
Eysi was telling me about a Node package patching library that Haukur recommends
And that this might make things simpler from a compliance standpoint
He thinks it might be this one: https://www.npmjs.com/package/patch-package
... just for when we've come up with the patches we want to apply and need to decide how to apply / maintain them
a

alert-helicopter-61082

05/10/2023, 9:32 AM
We use that already for the pkg library/tool actually
But yeah, I think trying to update the ink library would be a sensible step, along with try/catching some of the config store calls
Would be a good couple of steps to try, and we can see if that moves us along
s

swift-garage-61180

05/10/2023, 10:42 AM
The problem seems to be that we're not able to catch an exception that's thrown when the lockfile is compromised (or at least we haven't found a way to catch it yet).
So right now, we're still figuring out what changes we can make to the lockfile library that can remedy that situation. If we figure that out, we can then decide how to wrap the whole thing in a civilised way.
@ancient-diamond-80011 @cold-jordan-68753 Let me know when you're back from lunch and ready to hop back on a call
a

ancient-diamond-80011

05/10/2023, 10:45 AM
good timing, literally just finished 🍽️
s

swift-garage-61180

05/10/2023, 10:46 AM
Come to think of it, we might need to do something like
process.on("uncaughtException", (...) => ...)
. I have a feeling this is fundamentally because of the async nature of the
setTimeout
that's happening deep in the library.
Be right there
a

ancient-diamond-80011

05/10/2023, 10:47 AM
disagree - we should find the cause for the uncaught exception slipping between our try-catch fingers
s

swift-garage-61180

05/10/2023, 10:50 AM
... yeah, the above is probably a bad idea.
@cold-jordan-68753 @alert-helicopter-61082 Shall we take a look at this together at 13:45?
Veeti and I are getting a bit boiled, going to take a breather
c

cold-jordan-68753

05/10/2023, 11:23 AM
Yeah, sounds good
a

ancient-diamond-80011

05/10/2023, 11:26 AM
current knowledge: the issue is located here https://github.com/moxystudio/node-proper-lockfile/blob/9f8c303c91998e8404a911dc11c54029812bca69/lib/lockfile.js#L114-L127 - we throw within the
setLockAsCompromised
function (and need to do so) - that throw happens within the broader context of a
setTimeout
call on line 109, so it is happening in an "outside" async context, not caught by surrounding try-catches - however! if we monkeypatch the
setTimeout
call to have internal try-catches as well, it still doesn't help - of note: we are also within a callback function definition on the
options.fs.stat(
call (line 114) - it is also possible the execution of this stat call and the callback after it is handled in a separate async context in
graceful-fs
and also ends up outside what we can catch
basically, even this kind of "wrap everything in try catches" and "inside the callback, shove the error in a variable to be thrown later" trickery will not work because our control flow will go past the
if (asyncError)
before the callback completes

https://cdn.discordapp.com/attachments/1105488675841785936/1105819174346969098/Screenshot_2023-05-10_at_14.29.20.png

https://cdn.discordapp.com/attachments/1105488675841785936/1105819174632161300/Screenshot_2023-05-10_at_14.29.37.png

s

swift-garage-61180

05/10/2023, 11:32 AM
I'll send out a calendar invite, do this in a civilised way
Haha
c

cold-jordan-68753

05/10/2023, 11:38 AM
Since the lock is compromised it means we can't guarantee the operations in the critical section, so if we want to be safe we need to fail-stop at that point.
a

ancient-diamond-80011

05/10/2023, 11:42 AM
yes - this was not about returning to execution without throwing, this was about trying to make sure we can catch the exception, and bubble it up into our main context instead of exploding in the "outside context" of an async function
c

cold-jordan-68753

05/10/2023, 11:45 AM
yeah, thanks for clarifying 🙂
a

ancient-diamond-80011

05/10/2023, 11:45 AM
something in the
fs.stat
callback handler (where it calls
apply
and so on) seems to be happening in a separate async context, and we need to bring that back in to our main context
a

alert-helicopter-61082

05/10/2023, 11:59 AM
Ahead of that, it looks to me like the issue is in the promisification of the underlying lock function in the library. Basically we'd ideally need to be able to reject the lock promise in the onCompromised callback, and I see no clear way to do that except to use the underlying callback-based function.
The default onCompromised callback just throws the error, which inevitably crashes the process
Maybe something like this works:
typescript
  private async lock() {
    const path = this.getConfigPath()
    await ensureFile(path)
    return new Promise<() => Promise<void>>((resolve, reject) => {
      lock(path, { retries: 5, stale: 5000, onCompromised: reject }).then(resolve).catch(reject)
    })
  }
s

swift-garage-61180

05/10/2023, 12:13 PM
@ancient-diamond-80011 Could you give that a try?
a

ancient-diamond-80011

05/10/2023, 12:14 PM
thanks, trying now 👍
c

cold-jordan-68753

05/10/2023, 12:14 PM
yes, we are actually trying exactly that 🙂
a

alert-helicopter-61082

05/10/2023, 12:14 PM
I didn't test this at all, but you get the general idea at least
a

ancient-diamond-80011

05/10/2023, 12:14 PM
> inevitably crashes the process yeah - we do need to throw, we just need to do so on our own terms / in our own context (instead of in the separate async microtask / setTimeout / etc)
a

alert-helicopter-61082

05/10/2023, 12:15 PM
Yeah, this default onCompromised callback seems intended to kill the process, since that's inevitably an uncaught exception
I really don't think we need to pry into the underlying library though
That's at least clear then

https://cdn.discordapp.com/attachments/1105488675841785936/1105831596390633472/image.png

That do the trick?
Or at least solve that particular crash?
a

ancient-diamond-80011

05/10/2023, 12:46 PM
yes and no :/ using that sort of a reject removes the crash, but it also makes the code continue past what the lock is meant to protect (it will write the file)
sadly, it might be inevitable that we need to do a stop-the-world crash there, and only write our own
onCompromised
handler that catches the error, does some better logging for our users, and then throws the stop-the-world
and on the other topic - as for where the ecompromise and enoent are originating from, that's an even more difficult question than debugging this library has been 😅 we don't have a reliable reproduction for the "actual" issue - we just have a manual way to force it by manually deleting the lockfile. we would need to find out what kind of a concurrency issue is able to delete the lock from underneath us
a

alert-helicopter-61082

05/10/2023, 12:48 PM
Surely, we can't just need a stop-the-world crash, that's a bit crazy
a

ancient-diamond-80011

05/10/2023, 12:49 PM
we must, otherwise the writeFile atomic will happily continue
because we don't have real actual locks with a mutex
the writefile stuff will happily continue in its execution
a

alert-helicopter-61082

05/10/2023, 12:49 PM
Define real actual lock?
Anyhow, that's not an okay solution
a

ancient-diamond-80011

05/10/2023, 12:50 PM
the throw we are seeing is in an
updateLock
method which is defined as a
setTimeout
loop running in the background
if we catch it and reject, the
writeFile
execution will happily continue in the "other thread" (not exact terminology, but)
a

alert-helicopter-61082

05/10/2023, 12:50 PM
We just need to have a stance on when we feel it's okay to overwrite potentially conflicting changes and lose either of the concurrent writes
Considering what's being written there generally, it's not like it's mission critical to fully guarantee successful writes
c

cold-jordan-68753

05/10/2023, 12:52 PM
yeah, I think we can trust the atomic file write
just not the "concurrent" read+write operations we have in the critical section now
a

alert-helicopter-61082

05/10/2023, 12:54 PM
This lock is intended to avoid a change getting lost when a race happens between two writes. If that's a once in a while occurrence I don't believe it's a big deal.
I'd suggest our reaction to a "compromised" lock should be a warning and resetting of the lock.
Then we can go and see how and why exactly that might be happening, if it comes up more than we like
c

cold-jordan-68753

05/10/2023, 1:01 PM
yeah, it should be fine since at least we don't corrupt the file, just overwrite with some content that might be incorrect
a

ancient-diamond-80011

05/10/2023, 1:03 PM
one issue here is that even resetting the lock will throw (in an un-catchable way) if it has been destroyed by something else
a

alert-helicopter-61082

05/10/2023, 1:04 PM
I'm sure you can work that out 🙂 I suspect just deleting the lock path does the job
This is surely solvable without an ugly crash for the user
Either that or we just use some more basic file locking library
If this is one is more trouble than it's worth
Maybe it's overengineered for our super basic use case
a

ancient-diamond-80011

05/10/2023, 1:08 PM
yeah - this library hasn't been updated in 2 years, but it doesn't really look like there are alternatives that would be more maintained and in active use :/ https://www.npmjs.com/search?q=lock%20file
a

alert-helicopter-61082

05/10/2023, 1:09 PM
It is of course one of those things that shouldn't need constant maintenance, so that's not necessarily a concern