InstSimplify: lookthru casts, binops in folding shuffles #92668

artagnon · 2024-05-18T19:00:13Z

foldIdentityShuffles() currently suffers from the deficiency that it cannot look through casts or binary operators, and can only fold sequences of shuffles without any intervening instructions. Fix this deficiency by looking through chains of casts that preserve the number of elements in the vector, and chains of binary operators with identical operands. The way we do this is by saving the last instruction in the reduction chain as the RootVec, and rewriting the operands of the first instruction in the chain.

This patch also has a requirement on increasing the MaxRecursionLimit, and the AMDGPU and LoopVectorize test changes are due to that.

Future work could simplify the case of binary operands with different operands, and the multi-use case.

Alive2 proofs follow.

@fold_lookthrough_cast:
https://alive2.llvm.org/ce/z/YFKQGX
@fold_lookthrough_binop_same_operands:
https://alive2.llvm.org/ce/z/gf3WgB
@fold_lookthrough_cast_chain:
https://alive2.llvm.org/ce/z/MT9ApH
@fold_lookthrough_binop_chain:
https://alive2.llvm.org/ce/z/WV8HRn
@fold_lookthrough_cast_binop_chain:
https://alive2.llvm.org/ce/z/CXj-hh

-- 8<--
Based on #92407.

Add examples of patterns that can be simplified, but are currently not. This patch serves as a pre-commit test for llvm#48960.

foldIdentityShuffles() currently suffers from the deficiency that it cannot look through casts or binary operators, and can only fold sequences of shuffles without any intervening instructions. Fix this deficiency by looking through chains of casts that preserve the number of elements in the vector, and chains of binary operators with identical operands. The way we do this is by saving the last instruction in the reduction chain as the RootVec, and rewriting the operands of the first instruction in the chain. This patch also has a requirement on increasing the MaxRecursionLimit, and the AMDGPU and LoopVectorize test changes are due to that. Future work could simplify the case of binary operands with different operands, and the multi-use case. Alive2 proofs follow. @fold_lookthrough_cast: https://alive2.llvm.org/ce/z/YFKQGX @fold_lookthrough_binop_same_operands: https://alive2.llvm.org/ce/z/gf3WgB @fold_lookthrough_cast_chain: https://alive2.llvm.org/ce/z/MT9ApH @fold_lookthrough_binop_chain: https://alive2.llvm.org/ce/z/WV8HRn @fold_lookthrough_cast_binop_chain: https://alive2.llvm.org/ce/z/CXj-hh

llvmbot · 2024-05-18T19:00:42Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-amdgpu

Author: Ramkumar Ramachandra (artagnon)

Changes

foldIdentityShuffles() currently suffers from the deficiency that it cannot look through casts or binary operators, and can only fold sequences of shuffles without any intervening instructions. Fix this deficiency by looking through chains of casts that preserve the number of elements in the vector, and chains of binary operators with identical operands. The way we do this is by saving the last instruction in the reduction chain as the RootVec, and rewriting the operands of the first instruction in the chain.

This patch also has a requirement on increasing the MaxRecursionLimit, and the AMDGPU and LoopVectorize test changes are due to that.

Future work could simplify the case of binary operands with different operands, and the multi-use case.

Alive2 proofs follow.

@fold_lookthrough_cast:
https://alive2.llvm.org/ce/z/YFKQGX
@fold_lookthrough_binop_same_operands:
https://alive2.llvm.org/ce/z/gf3WgB
@fold_lookthrough_cast_chain:
https://alive2.llvm.org/ce/z/MT9ApH
@fold_lookthrough_binop_chain:
https://alive2.llvm.org/ce/z/WV8HRn
@fold_lookthrough_cast_binop_chain:
https://alive2.llvm.org/ce/z/CXj-hh

-- 8<--
Based on #92407.

Patch is 34.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92668.diff

5 Files Affected:

(modified) llvm/lib/Analysis/InstructionSimplify.cpp (+84-25)
(modified) llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll (+3-18)
(modified) llvm/test/Transforms/InstSimplify/shufflevector.ll (+225-6)
(modified) llvm/test/Transforms/LoopVectorize/reduction-inloop.ll (+27-30)
(modified) llvm/test/Transforms/LoopVectorize/scev-predicate-reasoning.ll (+17-18)

diff --git a/llvm/lib/Analysis/InstructionSimplify.cpp b/llvm/lib/Analysis/InstructionSimplify.cpp
index 37a7259a5cd02..5f90ce8a02631 100644
--- a/llvm/lib/Analysis/InstructionSimplify.cpp
+++ b/llvm/lib/Analysis/InstructionSimplify.cpp
@@ -48,7 +48,7 @@ using namespace llvm::PatternMatch;
 
 #define DEBUG_TYPE "instsimplify"
 
-enum { RecursionLimit = 3 };
+static unsigned RecursionLimit = 5;
 
 STATISTIC(NumExpand, "Number of expansions");
 STATISTIC(NumReassoc, "Number of reassociations");
@@ -5315,55 +5315,104 @@ Value *llvm::simplifyCastInst(unsigned CastOpc, Value *Op, Type *Ty,
   return ::simplifyCastInst(CastOpc, Op, Ty, Q, RecursionLimit);
 }
 
+using ReplacementTy = std::optional<std::pair<Value *, Value *>>;
+
 /// For the given destination element of a shuffle, peek through shuffles to
 /// match a root vector source operand that contains that element in the same
 /// vector lane (ie, the same mask index), so we can eliminate the shuffle(s).
-static Value *foldIdentityShuffles(int DestElt, Value *Op0, Value *Op1,
-                                   int MaskVal, Value *RootVec,
-                                   unsigned MaxRecurse) {
+static std::pair<Value *, ReplacementTy>
+foldIdentityShuffles(int DestElt, Value *Op0, Value *Op1, int MaskVal,
+                     Value *RootVec, unsigned MaxRecurse) {
   if (!MaxRecurse--)
-    return nullptr;
+    return {nullptr, std::nullopt};
 
   // Bail out if any mask value is undefined. That kind of shuffle may be
   // simplified further based on demanded bits or other folds.
   if (MaskVal == -1)
-    return nullptr;
+    return {nullptr, std::nullopt};
 
   // The mask value chooses which source operand we need to look at next.
-  int InVecNumElts = cast<FixedVectorType>(Op0->getType())->getNumElements();
+  unsigned InVecNumElts =
+      cast<FixedVectorType>(Op0->getType())->getNumElements();
   int RootElt = MaskVal;
   Value *SourceOp = Op0;
-  if (MaskVal >= InVecNumElts) {
+  if (MaskVal > -1 && static_cast<unsigned>(MaskVal) >= InVecNumElts) {
     RootElt = MaskVal - InVecNumElts;
     SourceOp = Op1;
   }
 
+  // The next RootVec preseves an existing RootVec for casts and binops, so that
+  // the final RootVec is the last cast or binop in the chain.
+  Value *NextRootVec = RootVec ? RootVec : SourceOp;
+
+  // Look through a cast instruction that preserves number of elements in the
+  // vector. Set the RootVec to the cast, the SourceOp to the operand and
+  // recurse. If, in a later stack frame, an appropriate ShuffleVector is
+  // matched, the example will reduce.
+  if (auto *SourceCast = dyn_cast<CastInst>(SourceOp))
+    if (auto *CastTy = dyn_cast<FixedVectorType>(SourceCast->getSrcTy()))
+      if (CastTy->getNumElements() == InVecNumElts)
+        return foldIdentityShuffles(DestElt, SourceCast->getOperand(0), Op1,
+                                    MaskVal, NextRootVec, MaxRecurse);
+
+  // Look through a binary operator, with two identical operands. Set the
+  // RootVec to the binop, the SourceOp to the operand, and recurse. If, in a
+  // later stack frame, an appropriate ShuffleVector is matched, the example
+  // will reduce.
+  Value *BinOpLHS = nullptr, *BinOpRHS = nullptr;
+  if (match(SourceOp, m_BinOp(m_Value(BinOpLHS), m_Value(BinOpRHS))) &&
+      BinOpLHS == BinOpRHS)
+    return foldIdentityShuffles(DestElt, BinOpLHS, Op1, MaskVal, NextRootVec,
+                                MaxRecurse);
+
   // If the source operand is a shuffle itself, look through it to find the
   // matching root vector.
   if (auto *SourceShuf = dyn_cast<ShuffleVectorInst>(SourceOp)) {
+    // Here, we use RootVec, because there is no requirement for finding the
+    // last shuffle in a chain. In fact, the zeroth operand of the first shuffle
+    // in the chain will be used as the RootVec for folding.
     return foldIdentityShuffles(
         DestElt, SourceShuf->getOperand(0), SourceShuf->getOperand(1),
         SourceShuf->getMaskValue(RootElt), RootVec, MaxRecurse);
   }
 
-  // TODO: Look through bitcasts? What if the bitcast changes the vector element
-  // size?
-
-  // The source operand is not a shuffle. Initialize the root vector value for
-  // this shuffle if that has not been done yet.
-  if (!RootVec)
-    RootVec = SourceOp;
-
-  // Give up as soon as a source operand does not match the existing root value.
-  if (RootVec != SourceOp)
-    return nullptr;
-
   // The element must be coming from the same lane in the source vector
   // (although it may have crossed lanes in intermediate shuffles).
   if (RootElt != DestElt)
-    return nullptr;
+    return {nullptr, std::nullopt};
+
+  // If NextRootVec is equal to SourceOp, no replacements are required. Just
+  // return NextRootVec as the leaf value of the recursion.
+  if (NextRootVec == SourceOp)
+    return {NextRootVec, std::nullopt};
+
+  // We again have to match the condition for a vector-num-element-preserving
+  // cast or binop with equal operands, as we are not assured of the recursion
+  // happening from the call after the previous match. The RootVec was set to
+  // the last cast or binop in the chain.
+  if ((isa<CastInst>(RootVec) &&
+       isa<FixedVectorType>(cast<CastInst>(RootVec)->getSrcTy()) &&
+       cast<FixedVectorType>(cast<CastInst>(RootVec)->getSrcTy())
+               ->getNumElements() == InVecNumElts) ||
+      (match(RootVec, m_BinOp(m_Value(BinOpLHS), m_Value(BinOpRHS))) &&
+       BinOpLHS == BinOpRHS))
+    // ReplacementSrc should be the User of the the first cast or binop in the
+    // chain. SourceOp is the reduced value, which should replace
+    // ReplacementSrc.
+    if (auto *ReplacementSrc = SourceOp->getUniqueUndroppableUser()) {
+      // If ReplacementSrc equals RootVec, it means that we didn't recurse after
+      // matching a cast or binop, and we should terminate the recursion and
+      // return the leaf value here.
+      if (ReplacementSrc == RootVec)
+        return {RootVec, std::nullopt};
+      // The User of ReplacementSrc is the first cast or binop in the chain.
+      // There could be other Users, but we constrain it to a unique User, since
+      // we perform RAUW later.
+      if (ReplacementSrc->hasOneUser())
+        return {RootVec, std::make_pair(ReplacementSrc, SourceOp)};
+    }
 
-  return RootVec;
+  return {nullptr, std::nullopt};
 }
 
 static Value *simplifyShuffleVectorInst(Value *Op0, Value *Op1,
@@ -5468,16 +5517,26 @@ static Value *simplifyShuffleVectorInst(Value *Op0, Value *Op1,
   // shuffle. This handles simple identity shuffles as well as chains of
   // shuffles that may widen/narrow and/or move elements across lanes and back.
   Value *RootVec = nullptr;
-  for (unsigned i = 0; i != MaskNumElts; ++i) {
+  ReplacementTy ReplaceInRootVec;
+  for (unsigned Idx = 0; Idx != MaskNumElts; ++Idx) {
     // Note that recursion is limited for each vector element, so if any element
     // exceeds the limit, this will fail to simplify.
-    RootVec =
-        foldIdentityShuffles(i, Op0, Op1, Indices[i], RootVec, MaxRecurse);
+    std::pair<Value *, ReplacementTy> Res =
+        foldIdentityShuffles(Idx, Op0, Op1, Indices[Idx], RootVec, MaxRecurse);
+    RootVec = Res.first;
+    ReplaceInRootVec = Res.second;
 
     // We can't replace a widening/narrowing shuffle with one of its operands.
     if (!RootVec || RootVec->getType() != RetTy)
       return nullptr;
   }
+
+  // Apply any replacements in RootVec that are applicable in case we're looking
+  // through a cast or binop.
+  if (ReplaceInRootVec) {
+    auto [ReplacementSrc, ReplacementDst] = *ReplaceInRootVec;
+    ReplacementSrc->replaceAllUsesWith(ReplacementDst);
+  }
   return RootVec;
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll b/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll
index d807c3909e656..5f36531badbf1 100644
--- a/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll
+++ b/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll
@@ -6,26 +6,11 @@ declare i32 @llvm.amdgcn.sffbh.i32(i32) nounwind readnone speculatable
 define amdgpu_kernel void @select_constant_cttz(ptr addrspace(1) noalias %out, ptr addrspace(1) nocapture readonly %arrayidx) nounwind {
 ; GCN-LABEL: select_constant_cttz:
 ; GCN:       ; %bb.0:
-; GCN-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x9
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_load_dword s2, s[2:3], 0x0
+; GCN-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x9
 ; GCN-NEXT:    s_mov_b32 s3, 0xf000
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_lshr_b32 s4, 1, s2
-; GCN-NEXT:    s_cmp_lg_u32 s2, 0
-; GCN-NEXT:    s_ff1_i32_b32 s2, s4
-; GCN-NEXT:    s_cselect_b64 s[4:5], -1, 0
-; GCN-NEXT:    s_and_b64 s[6:7], s[4:5], exec
-; GCN-NEXT:    s_cselect_b32 s2, -1, s2
-; GCN-NEXT:    s_flbit_i32 s6, s2
-; GCN-NEXT:    s_sub_i32 s8, 31, s6
-; GCN-NEXT:    s_cmp_eq_u32 s2, 0
-; GCN-NEXT:    s_cselect_b64 s[6:7], -1, 0
-; GCN-NEXT:    s_or_b64 s[4:5], s[4:5], s[6:7]
-; GCN-NEXT:    s_and_b64 s[4:5], s[4:5], exec
-; GCN-NEXT:    s_cselect_b32 s4, -1, s8
 ; GCN-NEXT:    s_mov_b32 s2, -1
-; GCN-NEXT:    v_mov_b32_e32 v0, s4
+; GCN-NEXT:    v_mov_b32_e32 v0, -1
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], 0
 ; GCN-NEXT:    s_endpgm
   %v    = load i32, ptr addrspace(1) %arrayidx, align 4
diff --git a/llvm/test/Transforms/InstSimplify/shufflevector.ll b/llvm/test/Transforms/InstSimplify/shufflevector.ll
index 460e90aa31d91..e17e69a617651 100644
--- a/llvm/test/Transforms/InstSimplify/shufflevector.ll
+++ b/llvm/test/Transforms/InstSimplify/shufflevector.ll
@@ -249,13 +249,13 @@ define <8 x i64> @PR30630(<8 x i64> %x) {
 ;         ret <2 x float> zeroinitializer
 define <2 x float> @PR32872(<2 x float> %x) {
 ; CHECK-LABEL: @PR32872(
-; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <2 x float> [[X:%.*]], <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
-; CHECK-NEXT:    [[TMP4:%.*]] = shufflevector <4 x float> zeroinitializer, <4 x float> [[TMP1]], <2 x i32> <i32 4, i32 5>
-; CHECK-NEXT:    ret <2 x float> [[TMP4]]
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <2 x float> [[X:%.*]], <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
+; CHECK-NEXT:    [[SHUF2:%.*]] = shufflevector <4 x float> zeroinitializer, <4 x float> [[SHUF]], <2 x i32> <i32 4, i32 5>
+; CHECK-NEXT:    ret <2 x float> [[SHUF2]]
 ;
-  %tmp1 = shufflevector <2 x float> %x, <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
-  %tmp4 = shufflevector <4 x float> zeroinitializer, <4 x float> %tmp1, <2 x i32> <i32 4, i32 5>
-  ret <2 x float> %tmp4
+  %shuf = shufflevector <2 x float> %x, <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
+  %shuf2 = shufflevector <4 x float> zeroinitializer, <4 x float> %shuf, <2 x i32> <i32 4, i32 5>
+  ret <2 x float> %shuf2
 }
 
 define <5 x i8> @splat_inserted_constant(<4 x i8> %x) {
@@ -284,3 +284,222 @@ define <2 x i8> @splat_inserted_constant_not_canonical(<3 x i8> %x, <3 x i8> %y)
   %splat2 = shufflevector <3 x i8> %y, <3 x i8> %ins2, <2 x i32> <i32 undef, i32 5>
   ret <2 x i8> %splat2
 }
+
+define <4 x i32> @fold_identity(<4 x i32> %x) {
+; CHECK-LABEL: @fold_identity(
+; CHECK-NEXT:    ret <4 x i32> [[X:%.*]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_identity2(<4 x i32> %x) {
+; CHECK-LABEL: @fold_identity2(
+; CHECK-NEXT:    [[SHL:%.*]] = shl <4 x i32> [[X:%.*]], <i32 1, i32 1, i32 1, i32 1>
+; CHECK-NEXT:    ret <4 x i32> [[SHL]]
+;
+  %shl = shl <4 x i32> %x, <i32 1, i32 1, i32 1, i32 1>
+  %shuf = shufflevector <4 x i32> %shl, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_identity3(<4 x i32> %x) {
+; CHECK-LABEL: @fold_identity3(
+; CHECK-NEXT:    [[SHL:%.*]] = shl <4 x i32> [[X:%.*]], [[X]]
+; CHECK-NEXT:    ret <4 x i32> [[SHL]]
+;
+  %shl = shl <4 x i32> %x, %x
+  %shuf = shufflevector <4 x i32> %shl, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @not_fold_identity(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_identity(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[SHUF]], <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
+; CHECK-NEXT:    ret <4 x i32> [[REVSHUF]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @not_fold_identity2(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_identity2(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[SHUF]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    ret <4 x i32> [[REVSHUF]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i64> @fold_lookthrough_cast(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_cast(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[X]] to <4 x i64>
+; CHECK-NEXT:    ret <4 x i64> [[ZEXT]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %zext = zext <4 x i32> %shuf to <4 x i64>
+  %revshuf = shufflevector <4 x i64> %zext, <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i64> %revshuf
+}
+
+define <4 x i64> @not_fold_lookthrough_cast(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_lookthrough_cast(
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[X:%.*]] to <4 x i64>
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i64> [[ZEXT]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    ret <4 x i64> [[REVSHUF]]
+;
+  %zext = zext <4 x i32> %x to <4 x i64>
+  %revshuf = shufflevector <4 x i64> %zext, <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i64> %revshuf
+}
+
+define <4 x i64> @not_fold_lookthrough_cast2(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_lookthrough_cast2(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[SHUF]] to <4 x i64>
+; CHECK-NEXT:    ret <4 x i64> [[ZEXT]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %zext = zext <4 x i32> %shuf to <4 x i64>
+  ret <4 x i64> %zext
+}
+
+define i32 @not_fold_lookthrough_bitcast(<4 x i8> %x) {
+; CHECK-LABEL: @not_fold_lookthrough_bitcast(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i8> [[X:%.*]], <4 x i8> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[BITCAST:%.*]] = bitcast <4 x i8> [[SHUF]] to i32
+; CHECK-NEXT:    ret i32 [[BITCAST]]
+;
+  %shuf = shufflevector <4 x i8> %x, <4 x i8> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %bitcast = bitcast <4 x i8> %shuf to i32
+  ret i32 %bitcast
+}
+
+define <8 x i16> @not_fold_lookthrough_bitcast2(<4 x i32> %x, <8 x i16> %y) {
+; CHECK-LABEL: @not_fold_lookthrough_bitcast2(
+; CHECK-NEXT:    [[CAST:%.*]] = bitcast <4 x i32> [[X:%.*]] to <8 x i16>
+; CHECK-NEXT:    [[OUT:%.*]] = shufflevector <8 x i16> [[Y:%.*]], <8 x i16> [[CAST]], <8 x i32> <i32 8, i32 1, i32 10, i32 3, i32 12, i32 5, i32 14, i32 7>
+; CHECK-NEXT:    ret <8 x i16> [[OUT]]
+;
+  %cast = bitcast <4 x i32> %x to <8 x i16>
+  %out = shufflevector <8 x i16> %y, <8 x i16> %cast, <8 x i32> <i32 8, i32 1, i32 10, i32 3, i32 12, i32 5, i32 14, i32 7>
+  ret <8 x i16> %out
+}
+
+define <4 x i32> @fold_lookthrough_binop_same_operands(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_binop_same_operands(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[X]], [[X]]
+; CHECK-NEXT:    ret <4 x i32> [[ADD]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %shuf
+  %revshuf = shufflevector <4 x i32> %add, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_lookthrough_binop_different_operands(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: @fold_lookthrough_binop_different_operands(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[SHUF]], [[Y:%.*]]
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[ADD]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    ret <4 x i32> [[REVSHUF]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %y
+  %revshuf = shufflevector <4 x i32> %add, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_lookthrough_binop_multiuse(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_binop_multiuse(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[SHUF]], [[SHUF]]
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[ADD]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD2:%.*]] = add <4 x i32> [[SHUF]], [[REVSHUF]]
+; CHECK-NEXT:    ret <4 x i32> [[ADD2]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %shuf
+  %revshuf = shufflevector <4 x i32> %add, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add2 = add <4 x i32> %shuf, %revshuf
+  ret <4 x i32> %add2
+}
+
+define <4 x i64> @fold_lookthrough_cast_chain(<4 x i16> %x) {
+; CHECK-LABEL: @fold_lookthrough_cast_chain(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i16> [[X:%.*]], <4 x i16> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i16> [[X]] to <4 x i32>
+; CHECK-NEXT:    [[SEXT:%.*]] = sext <4 x i32> [[ZEXT]] to <4 x i64>
+; CHECK-NEXT:    ret <4 x i64> [[SEXT]]
+;
+  %shuf = shufflevector <4 x i16> %x, <4 x i16> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %zext = zext <4 x i16> %shuf to <4 x i32>
+  %sext = sext <4 x i32> %zext to <4 x i64>
+  %revshuf = shufflevector <4 x i64> %sext, <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i64> %revshuf
+}
+
+define <4 x i32> @fold_lookthrough_binop_chain(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_binop_chain(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[X]], [[X]]
+; CHECK-NEXT:    [[ADD2:%.*]] = add <4 x i32> [[ADD]], [[ADD]]
+; CHECK-NEXT:    ret <4 x i32> [[ADD2]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %shuf
+  %add2 = add <4 x i32> %add, %add
+  %revshuf = shufflevector <4 x i32> %add2, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i64> @fold_lookthrough_cast_binop_chain(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_cast_binop_chain(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[X]] to <4 x i64>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i64> [[ZE...
[truncated]

llvmbot · 2024-05-18T19:00:43Z

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

foldIdentityShuffles() currently suffers from the deficiency that it cannot look through casts or binary operators, and can only fold sequences of shuffles without any intervening instructions. Fix this deficiency by looking through chains of casts that preserve the number of elements in the vector, and chains of binary operators with identical operands. The way we do this is by saving the last instruction in the reduction chain as the RootVec, and rewriting the operands of the first instruction in the chain.

This patch also has a requirement on increasing the MaxRecursionLimit, and the AMDGPU and LoopVectorize test changes are due to that.

Future work could simplify the case of binary operands with different operands, and the multi-use case.

Alive2 proofs follow.

@fold_lookthrough_cast:
https://alive2.llvm.org/ce/z/YFKQGX
@fold_lookthrough_binop_same_operands:
https://alive2.llvm.org/ce/z/gf3WgB
@fold_lookthrough_cast_chain:
https://alive2.llvm.org/ce/z/MT9ApH
@fold_lookthrough_binop_chain:
https://alive2.llvm.org/ce/z/WV8HRn
@fold_lookthrough_cast_binop_chain:
https://alive2.llvm.org/ce/z/CXj-hh

-- 8<--
Based on #92407.

Patch is 34.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92668.diff

5 Files Affected:

(modified) llvm/lib/Analysis/InstructionSimplify.cpp (+84-25)
(modified) llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll (+3-18)
(modified) llvm/test/Transforms/InstSimplify/shufflevector.ll (+225-6)
(modified) llvm/test/Transforms/LoopVectorize/reduction-inloop.ll (+27-30)
(modified) llvm/test/Transforms/LoopVectorize/scev-predicate-reasoning.ll (+17-18)

diff --git a/llvm/lib/Analysis/InstructionSimplify.cpp b/llvm/lib/Analysis/InstructionSimplify.cpp
index 37a7259a5cd02..5f90ce8a02631 100644
--- a/llvm/lib/Analysis/InstructionSimplify.cpp
+++ b/llvm/lib/Analysis/InstructionSimplify.cpp
@@ -48,7 +48,7 @@ using namespace llvm::PatternMatch;
 
 #define DEBUG_TYPE "instsimplify"
 
-enum { RecursionLimit = 3 };
+static unsigned RecursionLimit = 5;
 
 STATISTIC(NumExpand, "Number of expansions");
 STATISTIC(NumReassoc, "Number of reassociations");
@@ -5315,55 +5315,104 @@ Value *llvm::simplifyCastInst(unsigned CastOpc, Value *Op, Type *Ty,
   return ::simplifyCastInst(CastOpc, Op, Ty, Q, RecursionLimit);
 }
 
+using ReplacementTy = std::optional<std::pair<Value *, Value *>>;
+
 /// For the given destination element of a shuffle, peek through shuffles to
 /// match a root vector source operand that contains that element in the same
 /// vector lane (ie, the same mask index), so we can eliminate the shuffle(s).
-static Value *foldIdentityShuffles(int DestElt, Value *Op0, Value *Op1,
-                                   int MaskVal, Value *RootVec,
-                                   unsigned MaxRecurse) {
+static std::pair<Value *, ReplacementTy>
+foldIdentityShuffles(int DestElt, Value *Op0, Value *Op1, int MaskVal,
+                     Value *RootVec, unsigned MaxRecurse) {
   if (!MaxRecurse--)
-    return nullptr;
+    return {nullptr, std::nullopt};
 
   // Bail out if any mask value is undefined. That kind of shuffle may be
   // simplified further based on demanded bits or other folds.
   if (MaskVal == -1)
-    return nullptr;
+    return {nullptr, std::nullopt};
 
   // The mask value chooses which source operand we need to look at next.
-  int InVecNumElts = cast<FixedVectorType>(Op0->getType())->getNumElements();
+  unsigned InVecNumElts =
+      cast<FixedVectorType>(Op0->getType())->getNumElements();
   int RootElt = MaskVal;
   Value *SourceOp = Op0;
-  if (MaskVal >= InVecNumElts) {
+  if (MaskVal > -1 && static_cast<unsigned>(MaskVal) >= InVecNumElts) {
     RootElt = MaskVal - InVecNumElts;
     SourceOp = Op1;
   }
 
+  // The next RootVec preseves an existing RootVec for casts and binops, so that
+  // the final RootVec is the last cast or binop in the chain.
+  Value *NextRootVec = RootVec ? RootVec : SourceOp;
+
+  // Look through a cast instruction that preserves number of elements in the
+  // vector. Set the RootVec to the cast, the SourceOp to the operand and
+  // recurse. If, in a later stack frame, an appropriate ShuffleVector is
+  // matched, the example will reduce.
+  if (auto *SourceCast = dyn_cast<CastInst>(SourceOp))
+    if (auto *CastTy = dyn_cast<FixedVectorType>(SourceCast->getSrcTy()))
+      if (CastTy->getNumElements() == InVecNumElts)
+        return foldIdentityShuffles(DestElt, SourceCast->getOperand(0), Op1,
+                                    MaskVal, NextRootVec, MaxRecurse);
+
+  // Look through a binary operator, with two identical operands. Set the
+  // RootVec to the binop, the SourceOp to the operand, and recurse. If, in a
+  // later stack frame, an appropriate ShuffleVector is matched, the example
+  // will reduce.
+  Value *BinOpLHS = nullptr, *BinOpRHS = nullptr;
+  if (match(SourceOp, m_BinOp(m_Value(BinOpLHS), m_Value(BinOpRHS))) &&
+      BinOpLHS == BinOpRHS)
+    return foldIdentityShuffles(DestElt, BinOpLHS, Op1, MaskVal, NextRootVec,
+                                MaxRecurse);
+
   // If the source operand is a shuffle itself, look through it to find the
   // matching root vector.
   if (auto *SourceShuf = dyn_cast<ShuffleVectorInst>(SourceOp)) {
+    // Here, we use RootVec, because there is no requirement for finding the
+    // last shuffle in a chain. In fact, the zeroth operand of the first shuffle
+    // in the chain will be used as the RootVec for folding.
     return foldIdentityShuffles(
         DestElt, SourceShuf->getOperand(0), SourceShuf->getOperand(1),
         SourceShuf->getMaskValue(RootElt), RootVec, MaxRecurse);
   }
 
-  // TODO: Look through bitcasts? What if the bitcast changes the vector element
-  // size?
-
-  // The source operand is not a shuffle. Initialize the root vector value for
-  // this shuffle if that has not been done yet.
-  if (!RootVec)
-    RootVec = SourceOp;
-
-  // Give up as soon as a source operand does not match the existing root value.
-  if (RootVec != SourceOp)
-    return nullptr;
-
   // The element must be coming from the same lane in the source vector
   // (although it may have crossed lanes in intermediate shuffles).
   if (RootElt != DestElt)
-    return nullptr;
+    return {nullptr, std::nullopt};
+
+  // If NextRootVec is equal to SourceOp, no replacements are required. Just
+  // return NextRootVec as the leaf value of the recursion.
+  if (NextRootVec == SourceOp)
+    return {NextRootVec, std::nullopt};
+
+  // We again have to match the condition for a vector-num-element-preserving
+  // cast or binop with equal operands, as we are not assured of the recursion
+  // happening from the call after the previous match. The RootVec was set to
+  // the last cast or binop in the chain.
+  if ((isa<CastInst>(RootVec) &&
+       isa<FixedVectorType>(cast<CastInst>(RootVec)->getSrcTy()) &&
+       cast<FixedVectorType>(cast<CastInst>(RootVec)->getSrcTy())
+               ->getNumElements() == InVecNumElts) ||
+      (match(RootVec, m_BinOp(m_Value(BinOpLHS), m_Value(BinOpRHS))) &&
+       BinOpLHS == BinOpRHS))
+    // ReplacementSrc should be the User of the the first cast or binop in the
+    // chain. SourceOp is the reduced value, which should replace
+    // ReplacementSrc.
+    if (auto *ReplacementSrc = SourceOp->getUniqueUndroppableUser()) {
+      // If ReplacementSrc equals RootVec, it means that we didn't recurse after
+      // matching a cast or binop, and we should terminate the recursion and
+      // return the leaf value here.
+      if (ReplacementSrc == RootVec)
+        return {RootVec, std::nullopt};
+      // The User of ReplacementSrc is the first cast or binop in the chain.
+      // There could be other Users, but we constrain it to a unique User, since
+      // we perform RAUW later.
+      if (ReplacementSrc->hasOneUser())
+        return {RootVec, std::make_pair(ReplacementSrc, SourceOp)};
+    }
 
-  return RootVec;
+  return {nullptr, std::nullopt};
 }
 
 static Value *simplifyShuffleVectorInst(Value *Op0, Value *Op1,
@@ -5468,16 +5517,26 @@ static Value *simplifyShuffleVectorInst(Value *Op0, Value *Op1,
   // shuffle. This handles simple identity shuffles as well as chains of
   // shuffles that may widen/narrow and/or move elements across lanes and back.
   Value *RootVec = nullptr;
-  for (unsigned i = 0; i != MaskNumElts; ++i) {
+  ReplacementTy ReplaceInRootVec;
+  for (unsigned Idx = 0; Idx != MaskNumElts; ++Idx) {
     // Note that recursion is limited for each vector element, so if any element
     // exceeds the limit, this will fail to simplify.
-    RootVec =
-        foldIdentityShuffles(i, Op0, Op1, Indices[i], RootVec, MaxRecurse);
+    std::pair<Value *, ReplacementTy> Res =
+        foldIdentityShuffles(Idx, Op0, Op1, Indices[Idx], RootVec, MaxRecurse);
+    RootVec = Res.first;
+    ReplaceInRootVec = Res.second;
 
     // We can't replace a widening/narrowing shuffle with one of its operands.
     if (!RootVec || RootVec->getType() != RetTy)
       return nullptr;
   }
+
+  // Apply any replacements in RootVec that are applicable in case we're looking
+  // through a cast or binop.
+  if (ReplaceInRootVec) {
+    auto [ReplacementSrc, ReplacementDst] = *ReplaceInRootVec;
+    ReplacementSrc->replaceAllUsesWith(ReplacementDst);
+  }
   return RootVec;
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll b/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll
index d807c3909e656..5f36531badbf1 100644
--- a/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll
+++ b/llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll
@@ -6,26 +6,11 @@ declare i32 @llvm.amdgcn.sffbh.i32(i32) nounwind readnone speculatable
 define amdgpu_kernel void @select_constant_cttz(ptr addrspace(1) noalias %out, ptr addrspace(1) nocapture readonly %arrayidx) nounwind {
 ; GCN-LABEL: select_constant_cttz:
 ; GCN:       ; %bb.0:
-; GCN-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x9
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_load_dword s2, s[2:3], 0x0
+; GCN-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x9
 ; GCN-NEXT:    s_mov_b32 s3, 0xf000
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_lshr_b32 s4, 1, s2
-; GCN-NEXT:    s_cmp_lg_u32 s2, 0
-; GCN-NEXT:    s_ff1_i32_b32 s2, s4
-; GCN-NEXT:    s_cselect_b64 s[4:5], -1, 0
-; GCN-NEXT:    s_and_b64 s[6:7], s[4:5], exec
-; GCN-NEXT:    s_cselect_b32 s2, -1, s2
-; GCN-NEXT:    s_flbit_i32 s6, s2
-; GCN-NEXT:    s_sub_i32 s8, 31, s6
-; GCN-NEXT:    s_cmp_eq_u32 s2, 0
-; GCN-NEXT:    s_cselect_b64 s[6:7], -1, 0
-; GCN-NEXT:    s_or_b64 s[4:5], s[4:5], s[6:7]
-; GCN-NEXT:    s_and_b64 s[4:5], s[4:5], exec
-; GCN-NEXT:    s_cselect_b32 s4, -1, s8
 ; GCN-NEXT:    s_mov_b32 s2, -1
-; GCN-NEXT:    v_mov_b32_e32 v0, s4
+; GCN-NEXT:    v_mov_b32_e32 v0, -1
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 ; GCN-NEXT:    buffer_store_dword v0, off, s[0:3], 0
 ; GCN-NEXT:    s_endpgm
   %v    = load i32, ptr addrspace(1) %arrayidx, align 4
diff --git a/llvm/test/Transforms/InstSimplify/shufflevector.ll b/llvm/test/Transforms/InstSimplify/shufflevector.ll
index 460e90aa31d91..e17e69a617651 100644
--- a/llvm/test/Transforms/InstSimplify/shufflevector.ll
+++ b/llvm/test/Transforms/InstSimplify/shufflevector.ll
@@ -249,13 +249,13 @@ define <8 x i64> @PR30630(<8 x i64> %x) {
 ;         ret <2 x float> zeroinitializer
 define <2 x float> @PR32872(<2 x float> %x) {
 ; CHECK-LABEL: @PR32872(
-; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <2 x float> [[X:%.*]], <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
-; CHECK-NEXT:    [[TMP4:%.*]] = shufflevector <4 x float> zeroinitializer, <4 x float> [[TMP1]], <2 x i32> <i32 4, i32 5>
-; CHECK-NEXT:    ret <2 x float> [[TMP4]]
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <2 x float> [[X:%.*]], <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
+; CHECK-NEXT:    [[SHUF2:%.*]] = shufflevector <4 x float> zeroinitializer, <4 x float> [[SHUF]], <2 x i32> <i32 4, i32 5>
+; CHECK-NEXT:    ret <2 x float> [[SHUF2]]
 ;
-  %tmp1 = shufflevector <2 x float> %x, <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
-  %tmp4 = shufflevector <4 x float> zeroinitializer, <4 x float> %tmp1, <2 x i32> <i32 4, i32 5>
-  ret <2 x float> %tmp4
+  %shuf = shufflevector <2 x float> %x, <2 x float> zeroinitializer, <4 x i32> <i32 2, i32 2, i32 0, i32 1>
+  %shuf2 = shufflevector <4 x float> zeroinitializer, <4 x float> %shuf, <2 x i32> <i32 4, i32 5>
+  ret <2 x float> %shuf2
 }
 
 define <5 x i8> @splat_inserted_constant(<4 x i8> %x) {
@@ -284,3 +284,222 @@ define <2 x i8> @splat_inserted_constant_not_canonical(<3 x i8> %x, <3 x i8> %y)
   %splat2 = shufflevector <3 x i8> %y, <3 x i8> %ins2, <2 x i32> <i32 undef, i32 5>
   ret <2 x i8> %splat2
 }
+
+define <4 x i32> @fold_identity(<4 x i32> %x) {
+; CHECK-LABEL: @fold_identity(
+; CHECK-NEXT:    ret <4 x i32> [[X:%.*]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_identity2(<4 x i32> %x) {
+; CHECK-LABEL: @fold_identity2(
+; CHECK-NEXT:    [[SHL:%.*]] = shl <4 x i32> [[X:%.*]], <i32 1, i32 1, i32 1, i32 1>
+; CHECK-NEXT:    ret <4 x i32> [[SHL]]
+;
+  %shl = shl <4 x i32> %x, <i32 1, i32 1, i32 1, i32 1>
+  %shuf = shufflevector <4 x i32> %shl, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_identity3(<4 x i32> %x) {
+; CHECK-LABEL: @fold_identity3(
+; CHECK-NEXT:    [[SHL:%.*]] = shl <4 x i32> [[X:%.*]], [[X]]
+; CHECK-NEXT:    ret <4 x i32> [[SHL]]
+;
+  %shl = shl <4 x i32> %x, %x
+  %shuf = shufflevector <4 x i32> %shl, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @not_fold_identity(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_identity(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[SHUF]], <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
+; CHECK-NEXT:    ret <4 x i32> [[REVSHUF]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @not_fold_identity2(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_identity2(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[SHUF]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    ret <4 x i32> [[REVSHUF]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
+  %revshuf = shufflevector <4 x i32> %shuf, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i64> @fold_lookthrough_cast(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_cast(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[X]] to <4 x i64>
+; CHECK-NEXT:    ret <4 x i64> [[ZEXT]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %zext = zext <4 x i32> %shuf to <4 x i64>
+  %revshuf = shufflevector <4 x i64> %zext, <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i64> %revshuf
+}
+
+define <4 x i64> @not_fold_lookthrough_cast(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_lookthrough_cast(
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[X:%.*]] to <4 x i64>
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i64> [[ZEXT]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    ret <4 x i64> [[REVSHUF]]
+;
+  %zext = zext <4 x i32> %x to <4 x i64>
+  %revshuf = shufflevector <4 x i64> %zext, <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i64> %revshuf
+}
+
+define <4 x i64> @not_fold_lookthrough_cast2(<4 x i32> %x) {
+; CHECK-LABEL: @not_fold_lookthrough_cast2(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[SHUF]] to <4 x i64>
+; CHECK-NEXT:    ret <4 x i64> [[ZEXT]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %zext = zext <4 x i32> %shuf to <4 x i64>
+  ret <4 x i64> %zext
+}
+
+define i32 @not_fold_lookthrough_bitcast(<4 x i8> %x) {
+; CHECK-LABEL: @not_fold_lookthrough_bitcast(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i8> [[X:%.*]], <4 x i8> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[BITCAST:%.*]] = bitcast <4 x i8> [[SHUF]] to i32
+; CHECK-NEXT:    ret i32 [[BITCAST]]
+;
+  %shuf = shufflevector <4 x i8> %x, <4 x i8> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %bitcast = bitcast <4 x i8> %shuf to i32
+  ret i32 %bitcast
+}
+
+define <8 x i16> @not_fold_lookthrough_bitcast2(<4 x i32> %x, <8 x i16> %y) {
+; CHECK-LABEL: @not_fold_lookthrough_bitcast2(
+; CHECK-NEXT:    [[CAST:%.*]] = bitcast <4 x i32> [[X:%.*]] to <8 x i16>
+; CHECK-NEXT:    [[OUT:%.*]] = shufflevector <8 x i16> [[Y:%.*]], <8 x i16> [[CAST]], <8 x i32> <i32 8, i32 1, i32 10, i32 3, i32 12, i32 5, i32 14, i32 7>
+; CHECK-NEXT:    ret <8 x i16> [[OUT]]
+;
+  %cast = bitcast <4 x i32> %x to <8 x i16>
+  %out = shufflevector <8 x i16> %y, <8 x i16> %cast, <8 x i32> <i32 8, i32 1, i32 10, i32 3, i32 12, i32 5, i32 14, i32 7>
+  ret <8 x i16> %out
+}
+
+define <4 x i32> @fold_lookthrough_binop_same_operands(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_binop_same_operands(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[X]], [[X]]
+; CHECK-NEXT:    ret <4 x i32> [[ADD]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %shuf
+  %revshuf = shufflevector <4 x i32> %add, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_lookthrough_binop_different_operands(<4 x i32> %x, <4 x i32> %y) {
+; CHECK-LABEL: @fold_lookthrough_binop_different_operands(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[SHUF]], [[Y:%.*]]
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[ADD]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    ret <4 x i32> [[REVSHUF]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %y
+  %revshuf = shufflevector <4 x i32> %add, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i32> @fold_lookthrough_binop_multiuse(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_binop_multiuse(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[SHUF]], [[SHUF]]
+; CHECK-NEXT:    [[REVSHUF:%.*]] = shufflevector <4 x i32> [[ADD]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD2:%.*]] = add <4 x i32> [[SHUF]], [[REVSHUF]]
+; CHECK-NEXT:    ret <4 x i32> [[ADD2]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %shuf
+  %revshuf = shufflevector <4 x i32> %add, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add2 = add <4 x i32> %shuf, %revshuf
+  ret <4 x i32> %add2
+}
+
+define <4 x i64> @fold_lookthrough_cast_chain(<4 x i16> %x) {
+; CHECK-LABEL: @fold_lookthrough_cast_chain(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i16> [[X:%.*]], <4 x i16> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i16> [[X]] to <4 x i32>
+; CHECK-NEXT:    [[SEXT:%.*]] = sext <4 x i32> [[ZEXT]] to <4 x i64>
+; CHECK-NEXT:    ret <4 x i64> [[SEXT]]
+;
+  %shuf = shufflevector <4 x i16> %x, <4 x i16> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %zext = zext <4 x i16> %shuf to <4 x i32>
+  %sext = sext <4 x i32> %zext to <4 x i64>
+  %revshuf = shufflevector <4 x i64> %sext, <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i64> %revshuf
+}
+
+define <4 x i32> @fold_lookthrough_binop_chain(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_binop_chain(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i32> [[X]], [[X]]
+; CHECK-NEXT:    [[ADD2:%.*]] = add <4 x i32> [[ADD]], [[ADD]]
+; CHECK-NEXT:    ret <4 x i32> [[ADD2]]
+;
+  %shuf = shufflevector <4 x i32> %x, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  %add = add <4 x i32> %shuf, %shuf
+  %add2 = add <4 x i32> %add, %add
+  %revshuf = shufflevector <4 x i32> %add2, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x i32> %revshuf
+}
+
+define <4 x i64> @fold_lookthrough_cast_binop_chain(<4 x i32> %x) {
+; CHECK-LABEL: @fold_lookthrough_cast_binop_chain(
+; CHECK-NEXT:    [[SHUF:%.*]] = shufflevector <4 x i32> [[X:%.*]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext <4 x i32> [[X]] to <4 x i64>
+; CHECK-NEXT:    [[ADD:%.*]] = add <4 x i64> [[ZE...
[truncated]

arsenm · 2024-05-18T19:19:01Z

llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll

-; GCN-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x9
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_load_dword s2, s[2:3], 0x0
+; GCN-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x9


This looks like the point of the test was defeated, so should do something to keep the codegen similar

Not sure what we can do about this. Want to take a stab at making sure the test doesn't get reduced? Sorry, I'm not familiar with AMDGPU.

The AMDGPU part doesn't matter, just make sure whatever simplify now happens, still doesn't happen. It's super frustruating that InstSimplify ends up breaking codegen tests like this.

Is this just from the recursive depth increase?

Yeah, it's due the increase in recursion depth, but we can limit it to the shuffle, as @dtcxzyw suggested.

llvm/lib/Analysis/InstructionSimplify.cpp

artagnon · 2024-05-18T20:47:47Z

Why is the CI failing? All tests pass, so is it some timing issue?

davemgreen

Hi - I recently added something very similar in VectorCombine under #88693. I deliberately added it to vector combine as I thought trying the combine many times through the pipeline was best avoided. I didn't think we had things like that in instsimplify.

I have some patches to extend the current functionality, and it should be able to handle this case if I understand it correctly. I have been holding off adding additions until the comments from a4d1026 can be worked out. I can try and put up the zext/sext/trunc patch I have, it should already handle intrinsics and binops/other unary ops.

Why is the CI failing? All tests pass, so is it some timing issue?

That looks like ccache failed - it sometimes does that.

llvm/test/Transforms/InstSimplify/shufflevector.ll

artagnon · 2024-05-18T23:19:02Z

I had a look at #88693, and the approach seems a lot more complicated than the one in this patch. I think there's a clear separation of concerns though: InstCombine (and by extension, InstSimplify) is equipped to handle only simple cases, performing trivial removal and rewriting, where the transform is obviously profitable, while VectorCombine is geared toward more complicated transforms possibly involving intrinsics, rewriting the entire tree of instructions, and using a CostModel.

For instance, the case of binary operators with unequal operands and intrinsics can be handled in VectorCombine, but for the simple case of just looking through some intermediate instructions between canceling shuffles, and eliminating the shuffles can be done in InstSimplify.

Besides, there was already code to fold canceling shuffles in InstSimplify, and there are several existing tests in InstSimplify/InstCombine that exercise this. I merely extended it a little bit.

What do you think?

dtcxzyw · 2024-05-19T15:11:09Z

llvm/lib/Analysis/InstructionSimplify.cpp

@@ -48,7 +48,7 @@ using namespace llvm::PatternMatch;

 #define DEBUG_TYPE "instsimplify"

-enum { RecursionLimit = 3 };
+static unsigned RecursionLimit = 5;


Suggested change

static unsigned RecursionLimit = 5;

static constexpr unsigned RecursionLimit = 5;

@nikic Can you test the compile-time impact?

TBH I don't like increasing the recursion limit for all root instructions.
If necessary, it would be better to only increase the limit for shufflevector:

/// Given operands for a ShuffleVectorInst, fold the result or return null. Value *llvm::simplifyShuffleVectorInst(Value *Op0, Value *Op1, ArrayRef<int> Mask, Type *RetTy, const SimplifyQuery &Q) { static constexpr ShuffleRecursionLimit = 5; return ::simplifyShuffleVectorInst(Op0, Op1, Mask, RetTy, Q, ShuffleRecursionLimit); }

I'm surprised there's a threshold here that isn't 6

Compile-time: http://llvm-compile-time-tracker.com/compare.php?from=f6527774569790b5a5236f6e84f3f839ce6c2fff&to=4363c34b550a0908edf4ee89eda2fee72a096ba9&stat=instructions:u

arsenm · 2024-05-20T12:54:55Z

llvm/lib/Analysis/InstructionSimplify.cpp

+      // The User of ReplacementSrc is the first cast or binop in the chain.
+      // There could be other Users, but we constrain it to a unique User, since
+      // we perform RAUW later.
+      if (ReplacementSrc->hasOneUser())


Checking for users in InstSimplify feels weird, can you check for hasOneUse on the input instead/

arsenm · 2024-05-20T12:56:09Z

llvm/test/CodeGen/AMDGPU/select-constant-cttz.ll

-; GCN-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x9
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_load_dword s2, s[2:3], 0x0
+; GCN-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x9


The AMDGPU part doesn't matter, just make sure whatever simplify now happens, still doesn't happen. It's super frustruating that InstSimplify ends up breaking codegen tests like this.

Is this just from the recursive depth increase?

nikic

You cannot RAUW inside InstSimplify. InstSimplify is an analysis, it cannot change IR in any way.

artagnon · 2024-05-20T13:35:03Z

You cannot RAUW inside InstSimplify. InstSimplify is an analysis, it cannot change IR in any way.

Oh no. Looks like I have to clone the entire chain of instructions, make the replacement in the new head instruction, and return the new tail? Is InstSimplify poorly-suited for this kind of change, and should I just leave it up to VectorCombine?

arsenm · 2024-05-20T13:38:04Z

You cannot RAUW inside InstSimplify. InstSimplify is an analysis, it cannot change IR in any way.

Oh no. Looks like I have to clone the entire chain of instructions, make the replacement in the new head instruction, and return the new tail? Is InstSimplify poorly-suited for this kind of change, and should I just leave it up to VectorCombine?

You shouldn't be creating new instructions in instsimplify. It's for returning operands or existing instructions

artagnon · 2024-05-20T13:42:40Z

You shouldn't be creating new instructions in instsimplify. It's for returning operands or existing instructions

Okay, abandoning this PR, as the task is impossible to do within instsimplify. I'll move the tests to VectorCombine, and work with @davemgreen on his code.

nikic · 2024-05-20T14:17:05Z

@artagnon See https://llvm.org/docs/InstCombineContributorGuide.html#pick-the-correct-optimization-pass for basic guidance on which pass to put things. Putting this in InstCombine may be fine -- but if VectorCombine support is already on the way, no point in doing that.

artagnon added 9 commits May 16, 2024 15:12

InstSimplify: increase shufflevector test coverage

1ffc5fe

Add examples of patterns that can be simplified, but are currently not. This patch serves as a pre-commit test for llvm#48960.

InstSimplify/test: use poison instead of undef

21e90c0

InstSimplify/shufflevector: add more tests

d99b5b5

InstSimplify/shufflevector: remove redundant tests

bc78925

InstSimplify/shufflevector: add multiuse test

36d7f84

InstSimplify/shufflevector: add negative bitcast test

2fecbeb

InstSimplify/shufflevector: add negative cast test

cedcc81

InstSimplify/shufflevector: add identity tests

c1e7335

artagnon requested review from arsenm, nikic, fhahn, davemgreen and dtcxzyw May 18, 2024 19:00

llvmbot added backend:AMDGPU llvm:analysis llvm:transforms labels May 18, 2024

arsenm reviewed May 18, 2024

View reviewed changes

InstSimplify: address review; minor fixes

041187d

davemgreen reviewed May 18, 2024

View reviewed changes

llvm/test/Transforms/InstSimplify/shufflevector.ll Show resolved Hide resolved

dtcxzyw reviewed May 19, 2024

View reviewed changes

artagnon mentioned this pull request May 20, 2024

[VectorCombine] Add support for zext/sext/trunc to shuffleToIdentity #92696

Merged

arsenm reviewed May 20, 2024

View reviewed changes

nikic requested changes May 20, 2024

View reviewed changes

artagnon closed this May 20, 2024

artagnon deleted the is-shuffle-lookthru branch May 20, 2024 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InstSimplify: lookthru casts, binops in folding shuffles #92668

InstSimplify: lookthru casts, binops in folding shuffles #92668

artagnon commented May 18, 2024

llvmbot commented May 18, 2024 •

edited

llvmbot commented May 18, 2024

arsenm May 18, 2024

artagnon May 18, 2024

arsenm May 20, 2024

artagnon May 20, 2024

artagnon commented May 18, 2024

davemgreen left a comment

artagnon commented May 18, 2024

dtcxzyw May 19, 2024

arsenm May 20, 2024

nikic May 20, 2024

arsenm May 20, 2024

arsenm May 20, 2024

nikic left a comment

artagnon commented May 20, 2024

arsenm commented May 20, 2024

artagnon commented May 20, 2024

nikic commented May 20, 2024

	static unsigned RecursionLimit = 5;
	static constexpr unsigned RecursionLimit = 5;

InstSimplify: lookthru casts, binops in folding shuffles #92668

InstSimplify: lookthru casts, binops in folding shuffles #92668

Conversation

artagnon commented May 18, 2024

llvmbot commented May 18, 2024 • edited

llvmbot commented May 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artagnon commented May 18, 2024

davemgreen left a comment

Choose a reason for hiding this comment

artagnon commented May 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikic left a comment

Choose a reason for hiding this comment

artagnon commented May 20, 2024

arsenm commented May 20, 2024

artagnon commented May 20, 2024

nikic commented May 20, 2024

llvmbot commented May 18, 2024 •

edited