< ancient diamond 80011> < cold jordan 68753> Did you get an garden.io #💻

Join Discord

<@1014822592747671602> <@936552781664387143> Did y...

# 💻｜contributing

swift-garage-61180

05/09/2023, 1:37 PM

@ancient-diamond-80011 @cold-jordan-68753 Did you get any further with the lockfile thing?

ancient-diamond-80011

05/09/2023, 1:37 PM

yes and no 😅

swift-garage-61180

05/09/2023, 1:37 PM

Maybe we should rope Jón in on this?

ancient-diamond-80011

05/09/2023, 1:39 PM

we have a manual kind-of-repro, created a new helper command with

Copy code

ts
  async action(params: CommandParams<{}, {}>): Promise<CommandResult> {
    const { garden } = params
    const store = garden.globalConfigStore
    // not awaiting here
    const configPromise = store.set("clientAuthTokens", "throw.example.com", {
      token: "",
      refreshToken: "",
      validity: new Date(),
    })

    throw new Error("oops")
  }

and then adding some

sleep()

to the

set()

method on the configstore, and then manually removing the lock directory with

rmdir

while the command is sleeping, because then in the

finally

block of

cli.ts

we call

delete

for the activeProcesses

ancient-diamond-80011

05/09/2023, 1:41 PM

but sadly the "repro" requires manually deleting the lock file - we haven't managed to reproduce the

ENOENT

and

ECOMPROMISED

combo with just the sleeps and throws and sets/deletes

ancient-diamond-80011

05/09/2023, 1:41 PM

need to figure out what can cause the lockfile to be removed before the finally block here runs https://github.com/garden-io/garden/blob/80f5b2f9ad8dbcc53be553c35220efac433b522e/cli/src/cli.ts#L48

ancient-diamond-80011

05/09/2023, 1:49 PM

also of note: we get the yoga layout error any time anything throws for any reason, so it is not related to the lockfile issue per se

Copy code

/Users/walther/git/garden/node_modules/yoga-layout-prebuilt/yoga-layout/build/Release/nbind.js:53
        throw ex;
        ^

swift-garage-61180

05/09/2023, 2:00 PM

Gotcha. Could you write a more detailed description of the error and everything we know about it, and then tag Jón here?

swift-garage-61180

05/09/2023, 2:01 PM

... just to pre-empt any questions on what we're talking about here

ancient-diamond-80011

05/09/2023, 2:03 PM

sadly this is pretty much the extent of what we know :<

ancient-diamond-80011

05/09/2023, 2:04 PM

- we use the

globalConfigStore

somewhere - somewhere, somehow we get a race condition - something tries to update a lock that has already been deleted, or delete a lock that has already been deleted - very likely this is happening in the

finally

block of

cli.ts

swift-garage-61180

05/09/2023, 2:05 PM

Let's have a chat about this later, @alert-helicopter-61082

swift-garage-61180

05/09/2023, 2:06 PM

We need to get this fixed as soon as possible, so let's keep at it until we've gotten to the bottom of it

swift-garage-61180

05/09/2023, 2:06 PM

... sort of can't ship Bonsai with this bug in it

ancient-diamond-80011

05/09/2023, 2:07 PM

another possibility we explored was whether we could be hitting the

isOverThreshold

conditions in the

proper-lockfile

library in the

updateLock

method, possibly from e.g. heavy synchronous code overloading the eventloop and getting time skew on the

setTimeout

calls, leading to a difference in the expected time vs real time

ancient-diamond-80011

05/09/2023, 2:08 PM

but that is most likely not the case, considering in the real-world cases we've seen, we have definitely seen the

ENOENT

as part of the error messages

ancient-diamond-80011

05/09/2023, 2:09 PM

well, in the worst case, we could also potentially work around this and try to just provide a nicer error message experience. @cold-jordan-68753 opened a draft PR for that https://github.com/garden-io/garden/pull/4211

ancient-diamond-80011

05/09/2023, 2:09 PM

reproducing this issue has been very hard - unless we figure out a reliable repro, it is probably nearly impossible to get to the bottom of this

ancient-diamond-80011

05/09/2023, 2:11 PM

also of note: the last commit on this library is from Jan 2021 https://github.com/moxystudio/node-proper-lockfile - it is not actively maintained and hasn't been for a couple of years. sadly we didn't really find other promising library options either

ancient-diamond-80011

05/09/2023, 2:26 PM

we also took a brief look at the

onExit

handler in the lockfile library, but it's also probably not related - considering we arrive at this error without external signals such as Ctrl-C or similar

ancient-diamond-80011

05/09/2023, 2:27 PM

so while there are open issues like https://github.com/moxystudio/node-proper-lockfile/issues/111 and pr https://github.com/moxystudio/node-proper-lockfile/pull/112 it is most likely unrelated - i believe the

onExit

handler will not be called in case of a thrown exception, only on signals (but i could be wrong here)

ancient-diamond-80011

05/09/2023, 2:28 PM

(and besides, in the

onExit

handler any thrown errors are caught and not bubbled up, so we wouldn't see any errors from there)

alert-helicopter-61082

05/09/2023, 2:30 PM

I think one clear thing to do is to try/catch around some of the usages of the config stores, where we're sort of fine with a failure. A good example of just that is the cleaning up of the active process record. That plus manually calling the

onExit

handler (if we can) at the end of a CLI execution.

alert-helicopter-61082

05/09/2023, 2:30 PM

Those would be the first things I'd try at least

cold-jordan-68753

05/09/2023, 2:44 PM

yeah, we can add that to the code, there is also an

onCompromised

callback that we can potentially use instead of or in addition to the try/catch.

alert-helicopter-61082

05/09/2023, 2:49 PM

Regarding the yoga thing, we may just need to update Ink: https://github.com/vadimdemedes/ink/releases

alert-helicopter-61082

05/09/2023, 2:51 PM

Ugh crap, it's now pure ESM... I wonder if there's anything we can do more easily to support individual packages that are like that

alert-helicopter-61082

05/09/2023, 2:51 PM

The docs suggest we can use

await import(...)

for it, which should be fine for our use case I think

ancient-diamond-80011

05/09/2023, 2:53 PM

yeah the ESM stuff is becoming less and less avoidable at a fast pace :/

swift-garage-61180

05/09/2023, 3:16 PM

The mandatory ESM thing strikes me as a bizarre design decision. What's the value in forcing users to adapt to it? Does it really simplify anything?

swift-garage-61180

05/09/2023, 3:16 PM

Anyway, that's an irrelevant tangent—my bad

ancient-diamond-80011

05/09/2023, 3:17 PM

well, regardless of the reasons, it is getting increasingly common in the ecosystem and it will be impossible to avoid when updating a lot of our old packages. we have to make our build process support it eventually (and hopefully fairly soon)

swift-garage-61180

05/09/2023, 3:17 PM

Yeah

swift-garage-61180

05/10/2023, 8:16 AM

@ancient-diamond-80011 @cold-jordan-68753 Are we getting closer to a fix here?

ancient-diamond-80011

05/10/2023, 8:16 AM

some small steps towards a better direction earlier this morning, and continuing to pair right after this core daily call i'm on right now

ancient-diamond-80011

05/10/2023, 8:17 AM

current hypotheses - lockfile library manages to throw outside a promise (no reject) - yoga library overrides the uncaught exception handler of node - so in some cases we are unable to catch errors in our cli.ts - we just get properly thrown out of node

swift-garage-61180

05/10/2023, 8:36 AM

Ok, gotcha. I think it may be time to recruit more team members to pile on top of this.

swift-garage-61180

05/10/2023, 8:48 AM

Eysi was telling me about a Node package patching library that Haukur recommends

swift-garage-61180

05/10/2023, 8:49 AM

And that this might make things simpler from a compliance standpoint

swift-garage-61180

05/10/2023, 8:49 AM

He thinks it might be this one: https://www.npmjs.com/package/patch-package

swift-garage-61180

05/10/2023, 8:49 AM

... just for when we've come up with the patches we want to apply and need to decide how to apply / maintain them

alert-helicopter-61082

05/10/2023, 9:32 AM

We use that already for the pkg library/tool actually

alert-helicopter-61082

05/10/2023, 9:33 AM

But yeah, I think trying to update the ink library would be a sensible step, along with try/catching some of the config store calls

alert-helicopter-61082

05/10/2023, 9:33 AM

Would be a good couple of steps to try, and we can see if that moves us along

swift-garage-61180

05/10/2023, 10:42 AM

The problem seems to be that we're not able to catch an exception that's thrown when the lockfile is compromised (or at least we haven't found a way to catch it yet).

swift-garage-61180

05/10/2023, 10:43 AM

So right now, we're still figuring out what changes we can make to the lockfile library that can remedy that situation. If we figure that out, we can then decide how to wrap the whole thing in a civilised way.

swift-garage-61180

05/10/2023, 10:44 AM

@ancient-diamond-80011 @cold-jordan-68753 Let me know when you're back from lunch and ready to hop back on a call

ancient-diamond-80011

05/10/2023, 10:45 AM

good timing, literally just finished 🍽️

swift-garage-61180

05/10/2023, 10:46 AM

Come to think of it, we might need to do something like

process.on("uncaughtException", (...) => ...)

. I have a feeling this is fundamentally because of the async nature of the

setTimeout

that's happening deep in the library.

swift-garage-61180

05/10/2023, 10:47 AM

Be right there

ancient-diamond-80011

05/10/2023, 10:47 AM

disagree - we should find the cause for the uncaught exception slipping between our try-catch fingers

swift-garage-61180

05/10/2023, 10:50 AM

... yeah, the above is probably a bad idea.

swift-garage-61180

05/10/2023, 11:21 AM

@cold-jordan-68753 @alert-helicopter-61082 Shall we take a look at this together at 13:45?

swift-garage-61180

05/10/2023, 11:22 AM

Veeti and I are getting a bit boiled, going to take a breather

cold-jordan-68753

05/10/2023, 11:23 AM

Yeah, sounds good

ancient-diamond-80011

05/10/2023, 11:26 AM

current knowledge: the issue is located here https://github.com/moxystudio/node-proper-lockfile/blob/9f8c303c91998e8404a911dc11c54029812bca69/lib/lockfile.js#L114-L127 - we throw within the

setLockAsCompromised

function (and need to do so) - that throw happens within the broader context of a

setTimeout

call on line 109, so it is happening in an "outside" async context, not caught by surrounding try-catches - however! if we monkeypatch the

setTimeout

call to have internal try-catches as well, it still doesn't help - of note: we are also within a callback function definition on the

options.fs.stat(

call (line 114) - it is also possible the execution of this stat call and the callback after it is handled in a separate async context in

graceful-fs

and also ends up outside what we can catch

ancient-diamond-80011

05/10/2023, 11:30 AM

basically, even this kind of "wrap everything in try catches" and "inside the callback, shove the error in a variable to be thrown later" trickery will not work because our control flow will go past the

if (asyncError)

before the callback completes

swift-garage-61180

05/10/2023, 11:32 AM

I'll send out a calendar invite, do this in a civilised way

swift-garage-61180

05/10/2023, 11:32 AM

Haha

cold-jordan-68753

05/10/2023, 11:38 AM

Since the lock is compromised it means we can't guarantee the operations in the critical section, so if we want to be safe we need to fail-stop at that point.

ancient-diamond-80011

05/10/2023, 11:42 AM

yes - this was not about returning to execution without throwing, this was about trying to make sure we can catch the exception, and bubble it up into our main context instead of exploding in the "outside context" of an async function

cold-jordan-68753

05/10/2023, 11:45 AM

yeah, thanks for clarifying 🙂

ancient-diamond-80011

05/10/2023, 11:45 AM

something in the

fs.stat

callback handler (where it calls

apply

and so on) seems to be happening in a separate async context, and we need to bring that back in to our main context

alert-helicopter-61082

05/10/2023, 11:59 AM

Ahead of that, it looks to me like the issue is in the promisification of the underlying lock function in the library. Basically we'd ideally need to be able to reject the lock promise in the onCompromised callback, and I see no clear way to do that except to use the underlying callback-based function.

alert-helicopter-61082

05/10/2023, 12:00 PM

The default onCompromised callback just throws the error, which inevitably crashes the process

alert-helicopter-61082

05/10/2023, 12:01 PM

Maybe something like this works:

Copy code

typescript
  private async lock() {
    const path = this.getConfigPath()
    await ensureFile(path)
    return new Promise<() => Promise<void>>((resolve, reject) => {
      lock(path, { retries: 5, stale: 5000, onCompromised: reject }).then(resolve).catch(reject)
    })
  }

swift-garage-61180

05/10/2023, 12:13 PM

@ancient-diamond-80011 Could you give that a try?

ancient-diamond-80011

05/10/2023, 12:14 PM

thanks, trying now 👍

cold-jordan-68753

05/10/2023, 12:14 PM

yes, we are actually trying exactly that 🙂

alert-helicopter-61082

05/10/2023, 12:14 PM

I didn't test this at all, but you get the general idea at least

ancient-diamond-80011

05/10/2023, 12:14 PM

> inevitably crashes the process yeah - we do need to throw, we just need to do so on our own terms / in our own context (instead of in the separate async microtask / setTimeout / etc)

alert-helicopter-61082

05/10/2023, 12:15 PM

Yeah, this default onCompromised callback seems intended to kill the process, since that's inevitably an uncaught exception

alert-helicopter-61082

05/10/2023, 12:16 PM

I really don't think we need to pry into the underlying library though

alert-helicopter-61082

05/10/2023, 12:19 PM

That's at least clear then

alert-helicopter-61082

05/10/2023, 12:27 PM

That do the trick?

alert-helicopter-61082

05/10/2023, 12:27 PM

Or at least solve that particular crash?

ancient-diamond-80011

05/10/2023, 12:46 PM

yes and no :/ using that sort of a reject removes the crash, but it also makes the code continue past what the lock is meant to protect (it will write the file)

ancient-diamond-80011

05/10/2023, 12:47 PM

sadly, it might be inevitable that we need to do a stop-the-world crash there, and only write our own

onCompromised

handler that catches the error, does some better logging for our users, and then throws the stop-the-world

ancient-diamond-80011

05/10/2023, 12:48 PM

and on the other topic - as for where the ecompromise and enoent are originating from, that's an even more difficult question than debugging this library has been 😅 we don't have a reliable reproduction for the "actual" issue - we just have a manual way to force it by manually deleting the lockfile. we would need to find out what kind of a concurrency issue is able to delete the lock from underneath us

alert-helicopter-61082

05/10/2023, 12:48 PM

Surely, we can't just need a stop-the-world crash, that's a bit crazy

ancient-diamond-80011

05/10/2023, 12:49 PM

we must, otherwise the writeFile atomic will happily continue

ancient-diamond-80011

05/10/2023, 12:49 PM

because we don't have real actual locks with a mutex

ancient-diamond-80011

05/10/2023, 12:49 PM

the writefile stuff will happily continue in its execution

alert-helicopter-61082

05/10/2023, 12:49 PM

Define real actual lock?

alert-helicopter-61082

05/10/2023, 12:50 PM

Anyhow, that's not an okay solution

ancient-diamond-80011

05/10/2023, 12:50 PM

the throw we are seeing is in an

updateLock

method which is defined as a

setTimeout

loop running in the background

ancient-diamond-80011

05/10/2023, 12:50 PM

if we catch it and reject, the

writeFile

execution will happily continue in the "other thread" (not exact terminology, but)

alert-helicopter-61082

05/10/2023, 12:50 PM

We just need to have a stance on when we feel it's okay to overwrite potentially conflicting changes and lose either of the concurrent writes

alert-helicopter-61082

05/10/2023, 12:51 PM

Considering what's being written there generally, it's not like it's mission critical to fully guarantee successful writes

cold-jordan-68753

05/10/2023, 12:52 PM

yeah, I think we can trust the atomic file write

cold-jordan-68753

05/10/2023, 12:53 PM

just not the "concurrent" read+write operations we have in the critical section now

alert-helicopter-61082

05/10/2023, 12:54 PM

This lock is intended to avoid a change getting lost when a race happens between two writes. If that's a once in a while occurrence I don't believe it's a big deal.

alert-helicopter-61082

05/10/2023, 12:55 PM

I'd suggest our reaction to a "compromised" lock should be a warning and resetting of the lock.

alert-helicopter-61082

05/10/2023, 12:56 PM

Then we can go and see how and why exactly that might be happening, if it comes up more than we like

cold-jordan-68753

05/10/2023, 1:01 PM

yeah, it should be fine since at least we don't corrupt the file, just overwrite with some content that might be incorrect

ancient-diamond-80011

05/10/2023, 1:03 PM

one issue here is that even resetting the lock will throw (in an un-catchable way) if it has been destroyed by something else

alert-helicopter-61082

05/10/2023, 1:04 PM

I'm sure you can work that out 🙂 I suspect just deleting the lock path does the job

alert-helicopter-61082

05/10/2023, 1:05 PM

This is surely solvable without an ugly crash for the user

alert-helicopter-61082

05/10/2023, 1:06 PM

Either that or we just use some more basic file locking library

alert-helicopter-61082

05/10/2023, 1:06 PM

If this is one is more trouble than it's worth

alert-helicopter-61082

05/10/2023, 1:07 PM

Maybe it's overengineered for our super basic use case

ancient-diamond-80011

05/10/2023, 1:08 PM

yeah - this library hasn't been updated in 2 years, but it doesn't really look like there are alternatives that would be more maintained and in active use :/ https://www.npmjs.com/search?q=lock%20file

alert-helicopter-61082

05/10/2023, 1:09 PM

It is of course one of those things that shouldn't need constant maintenance, so that's not necessarily a concern

2 Views

Previous Next