[Inductor] log fusion failure due to index mismatch #124986

shunting314 · 2024-04-26T00:25:25Z

Stack from ghstack (oldest at bottom):

-> [Inductor] log fusion failure due to index mismatch #124986

The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things

index formula matches
var_ranges matches

In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories

rand_seed: the index for rand seed access is an integer and different access uses different integer offset
different numel: this happens for cat operation
broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
different loop orders: the major category we want inductor to be able to fuse
different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
unknown

My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )

For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @jansel @Chillee @eellison

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-04-26T00:25:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124986

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ec3f90a with merge base 68a1f78 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh) (trunk failure)
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard2DTraining::test_train_parity_2d_mlp

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: ba125d12bb9721c917c5c5e2a7bb03879f87f60f Pull Request resolved: #124986

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison · 2024-04-26T23:55:45Z

torch/_inductor/scheduler.py

+        for buf_name in common_buf_names:
+            lhs_dep = node1_name2dep[buf_name]
+            rhs_dep = node2_name2dep[buf_name]
+
+            if lhs_dep.get_numel() != rhs_dep.get_numel():
+                reasons[buf_name] = "different numel" 
+                continue
+
+            # Add more rules here


Maybe print out the deps themselves for common bufs ?

SchedulerNode.debug_str() already includes that.

BTW, this is not ready for review yet since i need add more rules to make it more useful. lol. But thanks for the stamp :)

(Sent out for review by mistake..)

Maybe print out the deps themselves for common bufs ?

I found sometimes it's more convenient to show the dep in the CSV directly so people don't need to open the text file for SchedulerNode.debug_str() to further look up. I'll do this when appropriate.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 419061ae2c4c2c64d16ab77f078c9593dbb80d26 Pull Request resolved: #124986

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: bfb1a0a425fda8fc9293cd6e2f5c52e1a57468a3 Pull Request resolved: #124986

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: cf2026a3ef4786b088f8ef579efc0965146bc12b Pull Request resolved: #124986

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: c116d371924d07f01f9ccf9c0f35a5af7811c181 Pull Request resolved: #124986

The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: fcde3203f7016e38123b0cc2e530e8e19133211e Pull Request resolved: #124986

shunting314 · 2024-05-03T20:54:18Z

I'm adding tests but otherwise the PR is ready to review. cc @jansel , @Chillee . Also cc @eellison if you want take another look since I added more functionality to the PR since your last review.

The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: cbd79ee7a489614e9ed274a83fa027aef5e764cb Pull Request resolved: #124986

The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: b8a4ef43c0d9ae70d2d7e0e1b2e61debce3a7230 Pull Request resolved: #124986

The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 8af6fd0a16e329ef3f06983b26d80932edb496f3 Pull Request resolved: #124986

torch/_inductor/codecache.py

Chillee · 2024-05-04T00:14:49Z

torch/_inductor/dependencies.py

+        """
+        return sympy_subs(self.index, {v: 0 for v in self.var_names})
+
+    def normalize_with_stride_order(self, prefix="t"):


Does this logic not exist somewhere else?

I don't think so. ir.ComputeBuffer.simplify_and_reorder is similar. But the major difference is, 'simplify_and_reorder' need consider all the indices used in a ComputedBuffer. Also, it does one pass for iteration dimensions and another pass for the reduction dimensions.

The specific function here only need handle a single index formula and we do a single pass (for all dimensions) to reorder dimensions and combine suitable adjacent pair of dimensions.

Another similar functions is dependencies._RecordLoadStoreInner.canonicalization. But that method can not reorder dimensions. So still quite different.

I think intuitively, the weird thing to me is that this function is trying to check whether we're unable to fuse because of X. But I presume that during fusion we also check if we're unable to fuse because of X. So why isn't that the same logic?

The simple answer is the_first_X != the_second_X.

During fusion, we don't care what causes the fusion failure. It can be any of the reasons I listed in the summary. But in this function, we want to check if fusion fail due to loop order mismatch.

The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 594c691fa9a8d24d45888aa3cc234dd99ef31de5 Pull Request resolved: #124986

The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 71e3087f759e9f2c29d56c525395b89a4e2e1d91 Pull Request resolved: #124986

shunting314 · 2024-05-06T23:40:30Z

@pytorchbot merge

pytorchmergebot · 2024-05-06T23:42:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

WIP: [Inductor] log fusion failure due to loop orders

9d38d93

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 26, 2024

shunting314 added a commit that referenced this pull request Apr 26, 2024

WIP: [Inductor] log fusion failure due to loop orders

8a9041a

ghstack-source-id: ba125d12bb9721c917c5c5e2a7bb03879f87f60f Pull Request resolved: #124986

shunting314 mentioned this pull request Apr 26, 2024

[inductor][easy] add buffer layout to SchedulerNode.debug_str #125081

Closed

shunting314 requested review from jansel, eellison and Chillee and removed request for jansel, eellison and Chillee April 26, 2024 23:34

eellison approved these changes Apr 26, 2024

View reviewed changes

This was referenced Apr 27, 2024

[inductor][easy] add buffer layout to SchedulerNode.debug_str #125090

Closed

[inductor] add triton code to SchedulerNode.debug_str #125091

Closed

shunting314 added a commit that referenced this pull request Apr 27, 2024

WIP: [Inductor] log fusion failure due to loop orders

ab7b1e6

ghstack-source-id: 419061ae2c4c2c64d16ab77f078c9593dbb80d26 Pull Request resolved: #124986

shunting314 added a commit that referenced this pull request Apr 29, 2024

WIP: [Inductor] log fusion failure due to loop orders

c506421

ghstack-source-id: bfb1a0a425fda8fc9293cd6e2f5c52e1a57468a3 Pull Request resolved: #124986

shunting314 added the topic: not user facing topic category label Apr 29, 2024

shunting314 added a commit that referenced this pull request Apr 30, 2024

WIP: [Inductor] log fusion failure due to loop orders

a726c90

ghstack-source-id: cf2026a3ef4786b088f8ef579efc0965146bc12b Pull Request resolved: #124986

shunting314 added a commit that referenced this pull request May 1, 2024

WIP: [Inductor] log fusion failure due to loop orders

bdd9bd6

ghstack-source-id: c116d371924d07f01f9ccf9c0f35a5af7811c181 Pull Request resolved: #124986

shunting314 changed the title ~~WIP: [Inductor] log fusion failure due to loop orders~~ WIP: [Inductor] log fusion failure due to index mismatch May 1, 2024

shunting314 added a commit that referenced this pull request May 3, 2024

WIP: [Inductor] log fusion failure due to loop orders

6456558

ghstack-source-id: fcde3203f7016e38123b0cc2e530e8e19133211e Pull Request resolved: #124986

shunting314 changed the title ~~WIP: [Inductor] log fusion failure due to index mismatch~~ [Inductor] log fusion failure due to index mismatch May 3, 2024

shunting314 requested review from jansel and Chillee May 3, 2024 20:53

shunting314 added a commit that referenced this pull request May 3, 2024

WIP: [Inductor] log fusion failure due to loop orders

5cceb06

ghstack-source-id: cbd79ee7a489614e9ed274a83fa027aef5e764cb Pull Request resolved: #124986

shunting314 added a commit that referenced this pull request May 3, 2024

WIP: [Inductor] log fusion failure due to loop orders

7e72785

ghstack-source-id: b8a4ef43c0d9ae70d2d7e0e1b2e61debce3a7230 Pull Request resolved: #124986

shunting314 added a commit that referenced this pull request May 3, 2024

[Inductor] log fusion failure due to loop orders

2b3368c

ghstack-source-id: 8af6fd0a16e329ef3f06983b26d80932edb496f3 Pull Request resolved: #124986

Chillee reviewed May 4, 2024

View reviewed changes

shunting314 added a commit that referenced this pull request May 6, 2024

[Inductor] log fusion failure due to loop orders

8d3152e

ghstack-source-id: 594c691fa9a8d24d45888aa3cc234dd99ef31de5 Pull Request resolved: #124986

shunting314 added a commit that referenced this pull request May 6, 2024

[Inductor] log fusion failure due to loop orders

21dbc7a

ghstack-source-id: 71e3087f759e9f2c29d56c525395b89a4e2e1d91 Pull Request resolved: #124986

jansel approved these changes May 6, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 6, 2024

pytorchmergebot added the merging label May 6, 2024

pytorchmergebot added the Merged label May 7, 2024

pytorchmergebot closed this in 48b6c8d May 7, 2024

pytorchmergebot removed the merging label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] log fusion failure due to index mismatch #124986

[Inductor] log fusion failure due to index mismatch #124986

shunting314 commented Apr 26, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 26, 2024 •

edited

eellison Apr 26, 2024

shunting314 Apr 27, 2024

shunting314 Apr 27, 2024

shunting314 May 1, 2024 •

edited

shunting314 commented May 3, 2024

Chillee May 4, 2024

shunting314 May 4, 2024

shunting314 May 4, 2024

Chillee May 4, 2024

shunting314 May 6, 2024

shunting314 commented May 6, 2024

pytorchmergebot commented May 6, 2024

[Inductor] log fusion failure due to index mismatch #124986

[Inductor] log fusion failure due to index mismatch #124986

Conversation

shunting314 commented Apr 26, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented Apr 26, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124986

✅ You can merge normally! (1 Unrelated Failure)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shunting314 May 1, 2024 • edited

Choose a reason for hiding this comment

shunting314 commented May 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shunting314 commented May 6, 2024

pytorchmergebot commented May 6, 2024

Merge started

shunting314 commented Apr 26, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 26, 2024 •

edited

shunting314 May 1, 2024 •

edited