-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is glob()
a real normal function or a special magic thing in WDL 1.0?
#680
Comments
Interesting...I think the intention has always been for I guess it does need to be clarified that
So my suggestion is:
|
How is the implementation for this envisioned to work? The spec for
Implementing 1 requires a bunch of capabilities from your container runner: either you need to be able to launch several processes from outside against the same filesystem, or you need to get an interactive connection to the running container. And if you do WDL file creation outside the container, you might need a way to mount more files dynamically, if you want to support something like Implementing 2 is a lot of work; your WDL runner needs to do something like ship a static Linux binary or portable Python installation to mount into each container to evaluate the outputs in there, or transpile a lot of the outputs section to Bash. And implementing 3 is hard to do efficiently. You don't want to be uploading We're looking into getting Toil's WDL runner going using GA4GH TES as a backend, and specifically Microsoft's implementation which doesn't support running multiple "executors" in a row and can only run one command in one container, with input and output file and directory paths that must be known at submission time. AWS Batch also only supports running a single command in a single container. And while Kubernetes lets you run multiple containers that interact, I think getting a Toil container to chat live with a user container is going to involve some exciting socket-to-Bash jury-rigging or fancy agent process injection. Right now in terms of implementations, it looks like Cromwell is taking a transpile-to-Bash approach, but they didn't implement that for full recursive expression trees and they require pre-computable arguments. MiniWDL has a |
I would probably change the glob rules to:
So a valid and passably efficient runner implementation would be to dump the whole container work directory to S3, pick through the S3 file listing with a hand-coded glob matching implementation, and then delete anything not needed. |
I don't think the spec is trying to imply that an engine developer should implement |
Using Python's
So the runner is supposed to return different results for |
Oh, right. I'm guessing this is a case where the spec reflects the Cromwell implementation rather than having considered the right way to do it. The current specified behavior for |
I'll close out the issue then; it sounds like the answer to the original question is that |
@adamnovak I agree, this looks like a bug in Cromwell. There's nothing that should stop the nesting working, other than an implementation detail in Cromwell (to do with creating a name for a list-of-glob-results output) that could be changed. Other than that, your intuition for how things should work is spot on |
In the WDL 1.0 spec,
glob()
doesn't appear in the builtin function list, but in the section that talks about it, it is described as a "function".If it really is a normal function, I should be able to put it anywhere I want in an expression (within the outputs section), including inside itself, or in a context where its argument depends on reading files left by the task.
But if it is actually not a function but a special piece of syntax (or if it is a function with special rules about what its argument may depend on) it might not be able to nest or have an argument depend on task output.
I have this WDL workflow:
Cromwell can't run this; it fails because it
Cannot perform globing while evaluating files
. Cromwell seems to implementglob()
by evaluating theglob()
argument before running the task command, in a context where further calls toglob()
aren't allowed (and wouldn't work anyway). Then it turns theglob()
call into some Bash to tack on to the command script, to put the files in a place it can collect them from, even if using a backend like TES where all output files and directories need to be known at task submission time.It seems like maybe, at least at WDL 1.0, this sort of non-function implementation of
glob()
is possibly intended to be allowed? It certainly makes implementing WDL on TES easier if glob arguments are allowed to be evaluated before the command is run.But in 1.1
glob()
appears in the function list with all the other functions, restricted totask
s but without any other special notes about what you can pass it.Does the spec, at any version, allow the runner to cut corners on its
glob()
implementation to guarantee that it can know the glob it needs to run before it starts the task command?The text was updated successfully, but these errors were encountered: