RFC: Compute only in int32/long/float/double for portable ops to save size #9635

swolchok · 2025-03-25T23:56:52Z

Concern: what if we're running on some sort of 16-bit microcontroller where this is a pessimization?

[ghstack-poisoned]

swolchok · 2025-03-25T23:56:53Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-03-25T23:56:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9635

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 11 Pending

As of commit ac64f9e with merge base 811352d ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for kernels/portable/cpu/util/elementwise_util.h:
pull / unittest / linux / linux-job (gh)
[ FAILED ] OpRSubScalarOutTest.ShortTensors
pull / unittest / macos / macos-job (gh)
[ FAILED ] OpRSubScalarOutTest.ShortTensors
pull / unittest-editable / linux / linux-job (gh)
[ FAILED ] OpRSubScalarOutTest.ShortTensors
pull / unittest-editable / macos / macos-job (gh)
[ FAILED ] OpRSubScalarOutTest.ShortTensors

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… size Concern: what if we're running on some sort of 16-bit microcontroller where this is a pessimization? ghstack-source-id: 66b89b1bde83c9af2f2915243c0d1c3fea8d9dd3 ghstack-comment-id: 2752794343 Pull-Request-resolved: #9635

swolchok · 2025-03-25T23:58:26Z

some sort of 16-bit microcontroller

wasn't sure if I was making things up, so: https://developer.arm.com/Processors/Ethos-U55 is a real example from the present day

swolchok · 2025-03-25T23:59:44Z

size impact: on my mac, test/build_size_test.sh reports that size_test_all_ops has size 1205136 before this PR and 1105856 after, a decrease of around 8%

swolchok · 2025-03-26T00:02:14Z

This is known to break tests, I think because it breaks SupportedTensorDtypes::SAME_AS_COMMON for reasons outlined in #9613, hence it is RFC status. The problem is fixable, but if we have directional concerns with this then I don't want to invest in fixing it.

[ghstack-poisoned]

… size Concern: what if we're running on some sort of 16-bit microcontroller where this is a pessimization? ghstack-source-id: a91380cd44aac19ce27d803605bca57e6de6e4ee ghstack-comment-id: 2752794343 Pull-Request-resolved: #9635

swolchok · 2025-03-26T16:45:36Z

per discussion with @manuelcandales, if we do this then we we need to cast through the "actual" compute type before casting to the output type so that we match ATen. example:

torch.ops.aten.mul(torch.tensor([100], dtype=torch.int8), torch.tensor([100], dtype=torch.int8), out=torch.zeros([1], dtype=torch.long))
tensor([16])

computing in int32 or int16 would cause this to yield 10000, not 16; casting through int8 would correct this.

swolchok · 2025-03-26T17:08:56Z

This is a bad idea because smaller compute dtypes benefit from additional SIMD lanes.

digantdesai · 2025-03-27T20:54:06Z

kernels/portable/cpu/util/elementwise_util.h

+  // Gate above optimization off if we appear to be on some kind of 8-bit or
+  // 16-bit CPU, which would invalidate our assumption about 32-bit
+  // math being just as fast.
+  constexpr bool cpu_appears_to_be_at_least_32_bit = sizeof(void*) >= 4 && sizeof(int) >= 4;


Update

6b0e11f

[ghstack-poisoned]

swolchok requested a review from manuelcandales as a code owner March 25, 2025 23:56

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2025

This was referenced Mar 25, 2025

elementwise_util: s/common/compute/ almost everywhere and deprecate SAME_AS_COMPUTE #9613

Merged

Deprecate non-internal elementwise_util APIs #9621

Merged

Strip size_test binaries and report their sizes in build_size_test.sh #9633

Closed

Update

ac64f9e

[ghstack-poisoned]

swolchok closed this Mar 26, 2025

digantdesai reviewed Mar 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Compute only in int32/long/float/double for portable ops to save size #9635

RFC: Compute only in int32/long/float/double for portable ops to save size #9635

swolchok commented Mar 25, 2025

swolchok commented Mar 25, 2025 •

edited

Loading

pytorch-bot bot commented Mar 25, 2025 •

edited

Loading

swolchok commented Mar 25, 2025

swolchok commented Mar 25, 2025

swolchok commented Mar 26, 2025 •

edited

Loading

swolchok commented Mar 26, 2025 •

edited

Loading

swolchok commented Mar 26, 2025

digantdesai Mar 27, 2025

RFC: Compute only in int32/long/float/double for portable ops to save size #9635

RFC: Compute only in int32/long/float/double for portable ops to save size #9635

Conversation

swolchok commented Mar 25, 2025

swolchok commented Mar 25, 2025 • edited Loading

pytorch-bot bot commented Mar 25, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9635

❌ 5 New Failures, 11 Pending

swolchok commented Mar 25, 2025

swolchok commented Mar 25, 2025

swolchok commented Mar 26, 2025 • edited Loading

swolchok commented Mar 26, 2025 • edited Loading

swolchok commented Mar 26, 2025

digantdesai Mar 27, 2025

Choose a reason for hiding this comment

swolchok commented Mar 25, 2025 •

edited

Loading

pytorch-bot bot commented Mar 25, 2025 •

edited

Loading

swolchok commented Mar 26, 2025 •

edited

Loading

swolchok commented Mar 26, 2025 •

edited

Loading