Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scalalib.ApeModule to produce APE (actually portable executable) assembly jar #4503

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

kiendang
Copy link

@kiendang kiendang commented Feb 7, 2025

Add an ApeModule to support producing actually portable executable assembly jars.

object foo extends JavaModule with scalalib.ApeModule
./mill show foo.assemblyApe

out.jar would run cross-platform (tested with arm Mac, x86 Linux and Windows (on cmd) with this example hello world project).

Copy link
Member

@lefou lefou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell what's the best name, but Ape feels a bit cryptic. It's IMHO not a well established term. Since we seem to use just one tool/technology jere, we might as well name the module after it. This is what we do in many other modules and it also leaves room for alternative implementations for APE. CosmoModule might be the better choice.

PathRef(Task.dest / "cosmocc" / "bin" / "cosmocc")
}

def launcherScript: T[Option[String]] = Task {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a prefix. Either ape or cosmo.

}
}

def assemblyApe: T[PathRef] = Task {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmoAssembly or apeAssembly. Depends on which name we choose above. Should be consistent though.

destJar: os.Path,
inputPaths: Agg[os.Path],
manifest: mill.api.JarManifest = mill.api.JarManifest.MillDefault,
prependShellScript: Option[String] = None,
prepend: Option[os.Source] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you need a byte array instead of a string. Woud be a Option[String|Array[Byte]] sufficient? Or do you expect large data here? A os.Source feels a bit too generic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option[String|Array[Byte]] would work as well. The data is around 300K in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lihaoyi What do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.Source does feel a bit unusual, I think a 300kb byte array is probably fine for now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic question: What is the simplest way to pass a x: String | Array[Byte] to os.write? Looks like there is no implicit conversion for unions. os.write(..., x) does not work. x: os.Source, x.asInstanceOf[os.Source] doesn't work either. Is there any other way that's less clunky than x match { case x: String => os.write(..., x); case x: Array[Byte] => os.write(..., x) }?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the answer. Could be, this isn't fully implemented in upickle. But we could also either fallback to always use a byte array, which sounds a bit inconvenient for the other cases, or go with a Option[Either[Array[Byte],String]] for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just using byte array is fine. Strings can be converted with just .getBytes, so it's no big hardship. Definitely easier than wrapping in Eithers and Lefts and Rights

@kiendang
Copy link
Author

kiendang commented Feb 7, 2025

@lefou Thanks for the review. Admittedly this is just a POC PR I quickly drafted up to demonstrate the idea. I haven't thought too much about organizing things yet, which obviously needs fleshing out if this is to be developed further.

Just want to mention a few drawbacks to this approach

  • The prepended compiled binary is around 300KB, quite a bit larger than the current shell script.
  • As I have tested, the assembly out.jar runs on arm Mac, arm and x86 Linux. On x86 Windows it runs on CMD, but doesn't on PowerShell (it just doesn't show anything). I'm not a Windows user so not yet sure what went wrong or even where to look. I'm using the same hello world example project here.
  • Even binaries compiled with cosmocc have hiccups with a couple of environments (https://github.com/jart/cosmopolitan?tab=readme-ov-file#platform-notes)
  • My launcher script use execvp which evokes the execve syscall. According to this table. execve has some problem on Windows, but that might have been solved in a recent update.

@lihaoyi
Copy link
Member

lihaoyi commented Feb 8, 2025

In terms of naming, I think ActuallyPortableExecutableModule would be best. It's bit of an unusual/niche thing so having the name be verbose is probably better than conciseness.

300kb is fine; any Scala program gets 4-5mb from scala-library anyway, and most Java programs will have similar dependency trees. Mill's own executable assemblies bundle Coursier with Graal for a >60mb jar prepend and it works great

We need to see what's going on with Powershell to see if it can be resolved. Currently Powershell is the default shell for Windows Terminal, and the whole point of APEs is to work seamlessly on Windows environments where previously you would need to rename out.jar to out.bat. If it's technically impossible at the moment I don't mind closing out the bounty regardless, but we should dig in a bit to see why it doesn't work.

One end-to-end test that should be done is to build the Mill dist.executable jar with an APE prefix replacing the shell prefix, and verify that it can be run without the assembly needing to be renamed with a .bat extension.

@kiendang
Copy link
Author

kiendang commented Feb 8, 2025

@lihaoyi I figured out the problem with PowerShell. It looks like PowerShell strictly follows PATHEXT. It won't run any binary if the file extension is not included in PATHEXT. Changing the file name to out.jar.com works. This means that we would have this naming problem on PowerShell regardless of the file content. It's impossible to make ./out.jar run as is on PowerShell, be it APE or any other format, unless the user already set $Env:PATHEXT += ';.JAR'. Now I understand why in most of Cosmopolitan examples the files have the .com extension.

@kiendang
Copy link
Author

kiendang commented Feb 8, 2025

This section on PATHEXT is interesting

To ensure that scripts for another scripting language run in the current console session, add the file extension used by the scripting language. For example, to run Python scripts in the current console, add the .py extension to the environment variable. For Windows to support the .py extension as an executable file you must register the file extension using the ftype and assoc commands of the CMD command shell.

and I checked

cmd /c ftype jarfile

jarfile="C:\Program Files\Java\jdk-17\bin\javaw.exe" -jar "%1" %*

This means that if the user installs java using the Windows installer, which registers jarfile with ftype, and we could change the assembly zip to be able to run java -jar out.jar instead of java -cp out.jar <main class> then we could do ./out.jar on PowerShell. That's a big condition though. I suppose it wouldn't work if you get java via other methods, i.e. download the archive, through coursier, ...

@lefou
Copy link
Member

lefou commented Feb 8, 2025

In terms of naming, I think ActuallyPortableExecutableModule would be best

I don't like that or a shorter version ProtableExecModule. We should name it after the tool at work, CosmoModule. We also have ScalafmtModule, ProguardModule or JmhModule instead of FormatModule, ShrinkAndObfuscateModule or BenchmarkModule.

@lihaoyi
Copy link
Member

lihaoyi commented Feb 10, 2025

@kiendang if the extension matters, what if we name the output file .exe? That should be executable on all windows shells right? And since Linux/Mac doesn't look at extensions, it should work there as well?

@sake92
Copy link
Contributor

sake92 commented Feb 11, 2025

This is the default for PATHEXT: .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
So any of these should work: mill.com, mill.exe, mill.cmd.
Depends what we think makes most sense. Mill.cmd feels like middle ground to me.
Mill.com sounds like a website, mill.exe sounds windows specific.. might confuse people

@kiendang
Copy link
Author

@lihaoyi Yup that would work.

@sake92 I actually think .EXE is the more appropriate here. The APE file format is a binary file format. It even starts with MZ which is the executable format for .EXE files. .CMD would be shell script files to be run by the Windows CMD shell. I don't think using either extension would mean much on Linux/Unix, but on Windows using .CMD instead of .EXE extension for these binaries would not be fitting.

@ all, I have a bit of problem doing this

One end-to-end test that should be done is to build the Mill dist.executable jar with an APE prefix

One way of enabling dist.ape.executable with the least amount of code I can think of is having dist.ape "extends" dist via an export clause

dist/build.mill

package build.dist

object `package` extends RootModule with InstallModule {
  ...
  
  object ape
    extends RootModule
    with InstallModule
    with mill.scalalib.ApeModule {

    export `package`.{executableRaw as _, *}

    override def executableRaw = ...
  }
}

but looks like mill doesn't support export clauses? If I do ./mill show dist.ape.executable I got

1 tasks failed
generateScriptSources dist/package.mill:334:21 expected "}"
    export `package`.{executableRaw as _, *}
                    ^
dist/package.mill:334:21 expected "}"
    export `package`.{executableRaw as _, *}

also inside dist/build.mill I can't seem able to access the mill.scalalib.ApeModule trait I added. Even putting something simple like trait x extends mill.scalalib.ApeModule would give me

[build.mill-61/65] compile
[build.mill-61] [info] compiling 1 Scala source to /home/kien/mill/out/mill-build/compile.dest/classes ...
[build.mill-61] [error] /home/kien/mill/dist/package.mill:333:35: type ApeModule is not a member of package mill.scalalib
[build.mill-61] [error] did you mean BomModule, RunModule, or SbtModule?
[build.mill-61] [error]   trait ape extends mill.scalalib.ApeModule {
[build.mill-61] [error]                                   ^
[build.mill-61] [error] one error found

Not sure if I miss something.

@lihaoyi
Copy link
Member

lihaoyi commented Feb 11, 2025

@kiendang Let's go with @lefou's suggestion of CosmoModule?

I'm not sure about export, but your snippet looks wrong. You should not need to have nested RootModules, and you shouldn't need to use exports to share tasks between modules as you should use traits instead.

However, for now you can't use stuff defined in your PR in Mill's own {build,package}.mill scripts until we publish a new version and update .config/mill-version. So it should be fine to just exercise your code in unit/integration/example tests, and we can optionally integrate it into mill proper in a follow up

@kiendang
Copy link
Author

Added the tests to integration/feature/cosmopolitan.

I moved mill.scalalib.ApeModule to mill.scalalib.cosmopolitan.CosmopolitanModule. Almost everywhere in the Cosmopolitan official repo and documentation it's referred to as Cosmopolitan instead of Cosmo (which I think is just a informal shortened name) so I think it's a better fit. If you think this is too long I'll change it to CosmoModule (sorry for the bikeshedding).

Right now where should I cache the Cosmo toolchain? I see #3930 but looks like nothing is standardized yet?

@lihaoyi
Copy link
Member

lihaoyi commented Feb 12, 2025

Yes there's no standard global cache yet, so caching i the task.dest folder is fine for now

requests.get.stream(s"https://cosmo.zip/pub/cosmocc/${versionedCosmocc}.zip")
)
os.remove.all(Task.dest / versionedCosmocc)
os.call(("unzip", zip, "-d", Task.dest / versionedCosmocc))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use os.unzip instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try os.unzip but it doesn't preserve permissions. Would have to have an extra step to give cosmocc executable permission. cosmoc.zip also contains other files. I'm not sure if not having the original permissions changes anything (though unlikely).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with os.zip then and os.perms.set to fix anything necessary, and open an issue on OS-Lib about os.unzip not preserving permissions so we can fix that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this doesn't look like a simple fix. With os.unzip after giving executable perm to files inside cosmocc/bin compilation still fails with

x86_64-linux-cosmo-gcc: fatal error: cannot execute 'cc1': posix_spawnp: No such file or directory
compilation terminated.

Looks like os.unzip lose stuff beyond permissions, e.g. if I'm not wrong symlinks are not preserved but replaced with copied files, and I'm not sure what exactly fails the compilation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, using a subprocess is probably fine for now, but can you open an issie upstream

Copy link
Member

@lefou lefou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name you settled with is fine. But I'd like to have some more consistent naming. We now have the Cosmopolitan, cosmo and ape as prefix. Would be nice to get rid of ape in favor to cosmo.

@kiendang
Copy link
Author

One more thing. The current shell script approach uses the JAVA_OPTS env var: java $JAVA_OPTS -cp .... It's not possible to do the exact same thing with C's execvp. execvp("java", {"java", getenv("JAVA_OPTS", ...)}) just doesn't work if JAVA_OPTS contains multiple values. Parsing and splitting JAVA_OPTS ourselves is out of the question.

The cleanest approach is to use java argument files, i.e. write the values of JAVA_OPTS to a temp file then do java @<path/to/temp/file> .... Would have the exact same effect. However argument files require Java 9 minimum.

Another approach is to append JAVA_OPTS to JAVA_TOOL_OPTIONS or JDK_JAVA_OPTIONS which will be picked up by java. The former works with Java 8.

There is a very small edge case with appending to JAVA_TOOL_OPTIONS approach in which the behavior differs from the current shell script approach though. Command line arguments take precedence over JDK_JAVA_OPTIONS, which in turns takes precedence over JAVA_TOOL_OPTIONS, so in case the same argument is set in both JAVA_OPTS and JDK_JAVA_OPTIONS, the value in JDK_JAVA_OPTIONS will win whereas with the current shell script approach the value in JAVA_OPTS wins. This looks like a rare edge case though.

In summary there are 3 approaches

  • Argument files. Cleanest and ensures same behavior with the current shell script approach, but requires Java 9.
  • Append to JDK_JAVA_OPTIONS. Also requires Java 9, so should just use argument files instead.
  • Append to JAVA_TOOL_OPTIONS. Works with Java 8, but there's a small edge case where the behavior differs from the current shell script approach.

@kiendang kiendang marked this pull request as ready for review February 14, 2025 16:55
@sideeffffect
Copy link
Contributor

sideeffffect commented Feb 14, 2025

Parsing and splitting JAVA_OPTS ourselves is out of the question.

Maybe it's possible after all:

#include <stdio.h>
#include <stdlib.h>
#include <wordexp.h>

int main() {
    char *env_var = getenv("JAVA_OPTS");
    if (!env_var) {
        printf("JAVA_OPTS is not set.\n");
        return 1;
    }

    wordexp_t p;
    if (wordexp(env_var, &p, 0) != 0) {
        perror("wordexp failed");
        return 1;
    }

    // Print parsed options
    printf("Parsed JAVA_OPTS:\n");
    for (size_t i = 0; i < p.we_wordc; i++) {
        printf("  Arg[%zu]: %s\n", i, p.we_wordv[i]);
    }

    wordfree(&p); // Free allocated memory
    return 0;
}

Courtesy of ChatGPT
image

While wordexp doesn't seem to be available in Cosmpolitan, there are permissively-licensed implementations:

https://www.gnu.org/software/libc/manual/html_node/Wordexp-Example.html

@kiendang kiendang force-pushed the ape branch 2 times, most recently from ef7c23e to b37159b Compare February 15, 2025 07:10
@lihaoyi
Copy link
Member

lihaoyi commented Feb 19, 2025

Looks good to me overall. Just some nits and docs and I think we can merge it

@@ -0,0 +1,12 @@
package build

import mill._, javalib._, scalalib._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put an alias javalib.cosmo so people don't need to import it from scalalib._ in Java projects

@kiendang kiendang force-pushed the ape branch 2 times, most recently from b985591 to f71da73 Compare February 20, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants