Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is glob() a real normal function or a special magic thing in WDL 1.0? #680

Closed
adamnovak opened this issue Jul 26, 2024 · 8 comments
Closed

Comments

@adamnovak
Copy link
Collaborator

adamnovak commented Jul 26, 2024

In the WDL 1.0 spec, glob() doesn't appear in the builtin function list, but in the section that talks about it, it is described as a "function".

If it really is a normal function, I should be able to put it anywhere I want in an expression (within the outputs section), including inside itself, or in a context where its argument depends on reading files left by the task.

But if it is actually not a function but a special piece of syntax (or if it is a function with special rules about what its argument may depend on) it might not be able to nest or have an argument depend on task output.

I have this WDL workflow:

version 1.0

workflow globRecursion {
    input {

    }
    call test_task {
    }
    output {
        Array[File] result = test_task.result
    }
}

task test_task {
    input {
    }

    command <<<
        echo "-is-*.txt" >globdata.txt
        echo "cooool" >math-is-cool.txt
        echo "uhoh" >math-is-hard.txt
        echo "yum" >lunch-is-cool.txt
        echo "lunch" >>globdata.txt
    >>>

    output {
        # Do some logic in the glob argument that requires globs to be able to depend on each other
        Array[File] result = glob(read_lines(glob("*data*")[0])[1] + read_lines(glob("glob*")[0])[0])
    }

    runtime {
        # Constrain Bash
        docker: "ubuntu:24.04"
    }
}

Cromwell can't run this; it fails because it Cannot perform globing while evaluating files. Cromwell seems to implement glob() by evaluating the glob() argument before running the task command, in a context where further calls to glob() aren't allowed (and wouldn't work anyway). Then it turns the glob() call into some Bash to tack on to the command script, to put the files in a place it can collect them from, even if using a backend like TES where all output files and directories need to be known at task submission time.

It seems like maybe, at least at WDL 1.0, this sort of non-function implementation of glob() is possibly intended to be allowed? It certainly makes implementing WDL on TES easier if glob arguments are allowed to be evaluated before the command is run.

But in 1.1 glob() appears in the function list with all the other functions, restricted to tasks but without any other special notes about what you can pass it.

Does the spec, at any version, allow the runner to cut corners on its glob() implementation to guarantee that it can know the glob it needs to run before it starts the task command?

@jdidion
Copy link
Collaborator

jdidion commented Aug 7, 2024

Interesting...I think the intention has always been for glob() to be a regular function, meaning it should be evaluated when it is called and not in advance.

I guess it does need to be clarified that glob will have different results depending on where it is called:

  • In the command section, it will evaluate all the files in the execution directory prior to the execution of the command. But, there should be no reason to use it instead of a bash glob.
  • In the output section, it will evaluate all the files in the execution directory following the successful execution of the command.
  • In an input or private declaration, we would like it to be that an expression containing a glob call is never evaluated until after all expressions that write files are evaluated. However, I'm not sure it's possible to guarantee this, so we might just want to disallow it.

So my suggestion is:

  1. Clarify that the behavior of glob is only defined in the output section, and that calling glob anywhere else might lead to undefined behavior and is deprecated and will be disallowed in WDL 2.0.
  2. Clarify that glob must be evaluated only after the task completes, and thus it is not allowed to implement glob by e.g. generating bash code that is appended to the end of the command.

@adamnovak
Copy link
Collaborator Author

How is the implementation for this envisioned to work?

The spec for glob() goes out of its way to specify that this does exactly what the container's bash binary does, which suggests that it is meant to be implemented by actually running the glob under that bash binary. If control flow is allowed to bounce between evaluating globs under the container's bash binary and other arbitrary WDL functions, I think you have three implementation options:

  1. Have the WDL runner implementation outside the container communicate interactively with the running container.
  2. Inject the whole expression evaluator part of the WDL runner into the container, so the container can run to completion non-interactively.
  3. Save the entire container filesystem so that you can run multiple containers against it.

Implementing 1 requires a bunch of capabilities from your container runner: either you need to be able to launch several processes from outside against the same filesystem, or you need to get an interactive connection to the running container. And if you do WDL file creation outside the container, you might need a way to mount more files dynamically, if you want to support something like glob(write_lines(glob("*.txt")) + "*").

Implementing 2 is a lot of work; your WDL runner needs to do something like ship a static Linux binary or portable Python installation to mount into each container to evaluate the outputs in there, or transpile a lot of the outputs section to Bash.

And implementing 3 is hard to do efficiently. You don't want to be uploading /tmp/ to persistent storage in e.g. Amazon S3, just because the workflow might evaluate glob("/tmp/" + read_lines("tempdir_name.txt")[0] + "/outputs/*.bam"). And even if glob arguments are not allowed to go above the task's working directory, a task might still dump large files in its working directory that are not outputs, and sending them to persistent storage and back, possibly several times if there are several levels of globbing to do, is wasteful.

We're looking into getting Toil's WDL runner going using GA4GH TES as a backend, and specifically Microsoft's implementation which doesn't support running multiple "executors" in a row and can only run one command in one container, with input and output file and directory paths that must be known at submission time. AWS Batch also only supports running a single command in a single container. And while Kubernetes lets you run multiple containers that interact, I think getting a Toil container to chat live with a user container is going to involve some exciting socket-to-Bash jury-rigging or fancy agent process injection.

Right now in terms of implementations, it looks like Cromwell is taking a transpile-to-Bash approach, but they didn't implement that for full recursive expression trees and they require pre-computable arguments. MiniWDL has a glob() that acts like an ordinary function, but it doesn't follow the rule about acting like the container's Bash; they evaluate the glob with Python's globbing implementation, and they restrict upwards path traversals to try to keep you confined to the task work directory mount point. Toil does actually invoke Bash for globbing, but doesn't (yet) do it in the user-supplied container with the right Bash, or handle/protect against upwards traversals above what was mounted.

@adamnovak
Copy link
Collaborator Author

I would probably change the glob rules to:

  1. Instead of echo <glob> as the definition, the glob pattern has defined semantics for *, ?, and [] wildcards, and for the sort order of results, targeting the lowest-common-denominator Bash/Python behavior. It shouldn't depend on the shipped bash binary. It can't be recursive.
  2. The glob is not allowed to match anything not physically in the container's work directory, accessible wherever it might be mounted. No .., no leading /, no traversing symlinks that point outside the work directory, and no traversing symlinks to absolute paths that depend on the work directory being at a particular absolute path. (Though leaf symlinks to input files themselves might be allowable.)
  3. Workflows SHOULD NOT (but still can) leave non-output files sitting around in the work directory.

So a valid and passably efficient runner implementation would be to dump the whole container work directory to S3, pick through the S3 file listing with a hand-coded glob matching implementation, and then delete anything not needed.

@jdidion
Copy link
Collaborator

jdidion commented Aug 7, 2024

I don't think the spec is trying to imply that an engine developer should implement glob(...) by replacing it with a call to Bash's glob, only that the results should be functionally equivalent. Can you use Python's glob module?

@adamnovak
Copy link
Collaborator Author

Using Python's glob() works fine in practice for all the workflows I've seen. But the spec at 1.1 does insist that you do exactly what bash in the container would do:

Non-standard Bash
The runtime container may use a non-standard Bash shell that supports more complex glob strings, such as allowing expansions that include a_inner.txt in the example above. To ensure that a WDL is portable when using glob, a container image should be provided and the WDL author should remember that glob results depend on coordination with the Bash implementation provided in that container.

So the runner is supposed to return different results for glob() depending on what the bash in the provided container image actually does. If you want to implement that on top of Python's glob module, you'd need a way to detect what kind of bash the container has and figure out what algorithm the task expects you to use to evaluate the glob. (I'm not sure where people are getting builds of bash that recursively evaluate * globs by default, but the spec seems to think they exist.)

@jdidion
Copy link
Collaborator

jdidion commented Aug 8, 2024

Oh, right. I'm guessing this is a case where the spec reflects the Cromwell implementation rather than having considered the right way to do it. The current specified behavior for glob() should be deprecated and replaced in WDL 2.0 with the requirement that the implementation returns results identical to GNU Bash. For now it seems that glob() will need to be implemented by shelling out to Bash and parsing the output.

@adamnovak
Copy link
Collaborator Author

I'll close out the issue then; it sounds like the answer to the original question is that glob() is indeed intended to act like any other function, and that nobody actually has a conformant implementation of it, so the spec might be adjusted for 2.0.

@cjllanwarne
Copy link
Contributor

@adamnovak I agree, this looks like a bug in Cromwell. There's nothing that should stop the nesting working, other than an implementation detail in Cromwell (to do with creating a name for a list-of-glob-results output) that could be changed. Other than that, your intuition for how things should work is spot on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants