Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU is missing simplify demanded bits optimizations of readfirstlane and similar operations #128390

Open
arsenm opened this issue Feb 23, 2025 · 4 comments
Assignees
Labels
backend:AMDGPU good first issue https://github.com/llvm/llvm-project/contribute missed-optimization

Comments

@arsenm
Copy link
Contributor

arsenm commented Feb 23, 2025

We are missing known bits and demanded bits optimizations which look through readfirstlane, readlane, and DPP operators. We need to insert extensions to produce a legal type, but these imply inserting instructions to appropriately set the high bits of the input value. These bits are never needed, and starting from the use context of the trunc, we should be able to delete it.

; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s

target triple = "amdgcn-amd-amdhsa"

; 	v_and_b32_e32 v0, 0xff, v0  ; Should be able to delete this
;	v_readfirstlane_b32 s4, v0
;	v_mov_b32_e32 v0, s4
define i8 @readfirstlane_demanded_i8_zext(i8 %src) {
  %zext = zext i8 %src to i32
  %readfirstlane = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %zext)
  %trunc = trunc i32 %readfirstlane to i8
  ret i8 %trunc
}

; 	v_bfe_i32 v0, v0, 0, 8 ; Should be able to delete this
; 	v_readfirstlane_b32 s4, v0
;	v_mov_b32_e32 v0, s4
define i8 @readfirstlane_demanded_i8_sext(i8 %src) {
  %zext = sext i8 %src to i32
  %readfirstlane = call i32 @llvm.amdgcn.readfirstlane.i32(i32 %zext)
  %trunc = trunc i32 %readfirstlane to i8
  ret i8 %trunc
}


This should be accomplished by implementing SimplifyDemandedBitsForTargetNode in SITargetLowering. This should handle INTRINSIC_WO_CHAIN operations, and handle Intrinsic::amdgcn_readfirstlane as the base example

As an example these appear in the tests from #128388

@arsenm arsenm added backend:AMDGPU good first issue https://github.com/llvm/llvm-project/contribute missed-optimization labels Feb 23, 2025
@llvmbot
Copy link
Member

llvmbot commented Feb 23, 2025

@llvm/issue-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

We are missing known bits and demanded bits optimizations which look through readfirstlane, readlane, and DPP operators. We need to insert extensions to produce a legal type, but these imply inserting instructions to appropriately set the high bits of the input value. These bits are never needed, and starting from the use context of the trunc, we should be able to delete it.
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 &lt; %s

target triple = "amdgcn-amd-amdhsa"

; 	v_and_b32_e32 v0, 0xff, v0  ; Should be able to delete this
;	v_readfirstlane_b32 s4, v0
;	v_mov_b32_e32 v0, s4
define i8 @<!-- -->readfirstlane_demanded_i8_zext(i8 %src) {
  %zext = zext i8 %src to i32
  %readfirstlane = call i32 @<!-- -->llvm.amdgcn.readfirstlane.i32(i32 %zext)
  %trunc = trunc i32 %readfirstlane to i8
  ret i8 %trunc
}

; 	v_bfe_i32 v0, v0, 0, 8 ; Should be able to delete this
; 	v_readfirstlane_b32 s4, v0
;	v_mov_b32_e32 v0, s4
define i8 @<!-- -->readfirstlane_demanded_i8_sext(i8 %src) {
  %zext = sext i8 %src to i32
  %readfirstlane = call i32 @<!-- -->llvm.amdgcn.readfirstlane.i32(i32 %zext)
  %trunc = trunc i32 %readfirstlane to i8
  ret i8 %trunc
}


This should be accomplished by implementing SimplifyDemandedBitsForTargetNode in SITargetLowering. This should handle INTRINSIC_WO_CHAIN operations, and handle Intrinsic::amdgcn_readfirstlane as the base example

As an example these appear in the tests from #128388

@llvmbot
Copy link
Member

llvmbot commented Feb 23, 2025

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. Check that no other contributor has already been assigned to this issue. If you believe that no one is actually working on it despite an assignment, ping the person. After one week without a response, the assignee may be changed.
  2. In the comments of this issue, request for it to be assigned to you, or just create a pull request after following the steps below. Mention this issue in the description of the pull request.
  3. Fix the issue locally.
  4. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  5. Create a Git commit.
  6. Run git clang-format HEAD~1 to format your changes.
  7. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation. Mention this issue in the description of the pull request.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

@llvmbot
Copy link
Member

llvmbot commented Feb 23, 2025

@llvm/issue-subscribers-good-first-issue

Author: Matt Arsenault (arsenm)

We are missing known bits and demanded bits optimizations which look through readfirstlane, readlane, and DPP operators. We need to insert extensions to produce a legal type, but these imply inserting instructions to appropriately set the high bits of the input value. These bits are never needed, and starting from the use context of the trunc, we should be able to delete it.
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 &lt; %s

target triple = "amdgcn-amd-amdhsa"

; 	v_and_b32_e32 v0, 0xff, v0  ; Should be able to delete this
;	v_readfirstlane_b32 s4, v0
;	v_mov_b32_e32 v0, s4
define i8 @<!-- -->readfirstlane_demanded_i8_zext(i8 %src) {
  %zext = zext i8 %src to i32
  %readfirstlane = call i32 @<!-- -->llvm.amdgcn.readfirstlane.i32(i32 %zext)
  %trunc = trunc i32 %readfirstlane to i8
  ret i8 %trunc
}

; 	v_bfe_i32 v0, v0, 0, 8 ; Should be able to delete this
; 	v_readfirstlane_b32 s4, v0
;	v_mov_b32_e32 v0, s4
define i8 @<!-- -->readfirstlane_demanded_i8_sext(i8 %src) {
  %zext = sext i8 %src to i32
  %readfirstlane = call i32 @<!-- -->llvm.amdgcn.readfirstlane.i32(i32 %zext)
  %trunc = trunc i32 %readfirstlane to i8
  ret i8 %trunc
}


This should be accomplished by implementing SimplifyDemandedBitsForTargetNode in SITargetLowering. This should handle INTRINSIC_WO_CHAIN operations, and handle Intrinsic::amdgcn_readfirstlane as the base example

As an example these appear in the tests from #128388

@ethan0150
Copy link

I'd like to work on this. Could I get assigned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU good first issue https://github.com/llvm/llvm-project/contribute missed-optimization
Projects
None yet
Development

No branches or pull requests

3 participants