Add a checking mechanism for the need of Renormalize pass in Parse pipeline

name	about	labels
RFC	New enhancement: Add a checking mechanism for the need of Renormalize pass in Parse pipeline	kind/enhancement

Background

During the Parse pipeline, some passes and/or group of passes are run multiple times. Renormalize pass is one of these. After the 1st Renormalize pass, the whole computation graph is inferred but later optimization passes may make changes to it. But depending on the actual input graph, many passes don't make actual changes to the graph, others do modify the graph but the newly generated nodes have the type inferred/inherited from the original nodes. In those cases the Renormalize pass doesn't need to be called again to save processing time.
This RFC will discuss a mechanism for checking if a Renormalize pass is needed after a group of sub-passes was run.

Introduction

During the Parse pipeline, many functional/optimization passes are run. They are grouped into so called "group passes". For example, there are 4 groups named as: opt_a, opt_b, opt_control and opt_prepare (please see pass.cc). Each group contains sub-groups, sub-groups contain sub-passes, etc. Each sub-pass check the actual graph parsed from the input code and may modify the graph. In order to ensure that the graph is fully inferred (with all the nodes have a correct deduced type), the Renormalize pass is called multiple times during the parsing process (minimum of 4 times, test_bert_train calls 6 times as seen in the following picture)
输入图片说明
We have observed that depending on the actual input graph, many passes don't make actual changes to the graph, others do modify the graph but the newly generated nodes have the type inferred/inherited from the original nodes. In those cases the Renormalize pass doesn't need to be called again to save processing time.
The idea of this solution is to add a mechanism of collecting all newly created (or modified) nodes from the sub-passes and when a Renormalize pass is called, the pass will be skipped if the collection is empty. And because of the organization of group passes in Parse pipeline, we consider to pilot this mechanism at the level of group passes.
Further investigation shows that:

In a group of passes, all the sub-passes are called with the same Optimizer object.
The two groups opt-b and opt-control (at this moment) consist of type-preserving sub-passes, i.e. if the input graph was fully inferred then the newly created node would have correct type inferred too. So we will start to test the design with those two groups of passes.
Group opt_a is not included in the test yet because opt_a is called 1st, when the input graph is not yet inferred so the 1st full Renormalize is always needed. And opt_a also contains the grad sub-pass, which may totally change the graph so the whole new graph will need to be inferred.

Our design

Enable a flag for each group of passes to switch on/of the checking option
Create a shared collection of un-typed nodes among the passes
If the checking is on, then each sub-pass in the group will update the shared collection with the newly generated nodes in its run.
When the Renormalize pass of the group is called, it will check first the shared collection. If it's empty, then this call of Renormalize pass will be skipped.

Our implementation

We tested the idea with the opt_b and opt_control groups now. A flag will be set with the group definition as:

输入图片说明

We observed that these two groups consist of "cse" sub-pass (which is type-preserving) and other sub-passes based on MakeSubstitution method. So we added the reporting mechanism to this method, which will work for all the sub-passes we needed. The flag setting from the previous step will assure that the performances of other groups of passes are not changed.

Test results

The implementation was tested with the network testing that are available in mindspore repo at this moment. All the tests were passed correctly. We shows below the percentage of time that opt_b and opt_c takes in the Parsing pipeline as the measure of the comparison. These tests are:
1. BERT: located at ./tests/perf_test/bert/test_bert_tran.py
2. RESNET train: located at ./tests/perf_test/test_resnet_train.py
3. RESNET infer: located at ./tests/perf_test/test_resnet_infer.py
4. LeNet: located at ./tests/ut/python/model/test_lenet.py
For BERT: Originally the opt_b.renormalize takes about 2.95% of the total (parsing) time, opt_control.renormalize takes 1.42%, which contributes a total of 4.37%. With the changes in this RRC, the total time drop significantly to almost 0%.
For RESNET train test: Originally the total time for opt_b and opt_control passes uses 5,36% of total time, with this RFC, the total time drops to 1,98%.
For RESNET infer test: Originally the time for opt_b (no opt_control in this case) was 0,66%. With this RFC, the time drops to almost 0%.
For LeNet network: Originally the time for opt_b and opt_control passes was 3,44%. With this RFC, the time again drops to almost 0%.

Notes for further development:

We plan to implement the so-called "Partial Renormalize" pass, in which only newly nodes without types will be inferred, so the shared collection of nodes will be used to provide these nodes. If only a full Renormalize is available, then a simply flag signalling that the shared collection is not empty would be enough to activate or skip this pass.
The opt_a group of passes is worth of consideration. The organization of this group is now as shown below:

It contains a very specific "grad" pass, which would totally change the graph. Also at the 1st run, the input graph is not inferred. But after the 1st run, the graph is inferred and the mechanism may be effectively applied. (The "opt_a" group is called also multiple times, as shown in the picture above 3 times for test_bert_train).
It would be great to have all the passes sharing the same mechanism of reporting, which in turn would help the Optimizer or the Parser to better decide when to call the Renormalize.

Trail

No.	Task Description	Related Issue(URL)
1
2

Hey @thlinh, Welcome to MindSpore Community.
All of the projects in MindSpore Community are maintained by @mindspore-ci-bot.
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/mindspore/community/blob/master/command.md to find the details.

please give your questions or opinions @kungsen @weiwei-w @xychow

The related PR from Linh: !121:Add a checking mechanism for the need of Renormalize pass in Parse pipeline

Hi, Linh. Thank you for your RFC & PR. It is a good step to do type-preserved opt to reduce the cost of renorm.

I have two question.
Question 1: Can renorm of opt_a also be checked? As in the last iteration of opt_a, the renorm may be reduntant.

Quetions 2: I think there maybe unsufficient that whether there is a case the new substituted result node is typed while some of its inputs are constructed with untyped? If there is no such case in our current opt_b passes currently, we need to add this as a new rule to all passes. If there are some untyped nodes in substituted nodes, the root of these nodes must be untyped.
There may be three choices we can choose to make sure all the sub-passes are "shallow".
Choice 1 is obey the rule manually. Every new subtree to replace has to obey the rule "if ever a subnode is untyped then the root must be untyped".
Choice 2 is to use "local type inference". Because before substitution every happens, type information is complete and thus all the types of the parameter are there; so every subnode in the subtree that is parameter can be easily typed -- in that way an easy bottom up, linear time inference is possible. But, if there is a global level renorm behind the current pass, the local level type infer is reduntant.
Choice 3 is to refact our IR interface to have a single "standardized" method of graph modifying in our infrastructure, like LLVM replaceAllUsesWith. So that, we can make sure all the new nodes are typed or untyped.
I think we can combine Choice 1 and Choice 3 to add a check when we relace the nodes. By using the nodes() or all_nodes() from graph manager, we can travel all the new nodes from the root node easyly.

The related PR from Linh: !121:Add a checking mechanism for the need of Renormalize pass in Parse pipeline
Hi, Linh. Thank you for your RFC & PR. It is a good step to do type-preserved opt to reduce the cost of renorm.
I have two question.
Question 1: Can renorm of opt_a also be checked? As in the last iteration of opt_a, the renorm may be reduntant.
Quetions 2: I think there maybe unsufficient that whether there is a case the new substituted result node is typed while some of its inputs are constructed with untyped? If there is no such case in our current opt_b passes currently, we need to add this as a new rule to all passes. If there are some untyped nodes in substituted nodes, the root of these nodes must be untyped.
There may be three choices we can choose to make sure all the sub-passes are "shallow".
Choice 1 is obey the rule manually. Every new subtree to replace has to obey the rule "if ever a subnode is untyped then the root must be untyped".
Choice 2 is to use "local type inference". Because before substitution every happens, type information is complete and thus all the types of the parameter are there; so every subnode in the subtree that is parameter can be easily typed -- in that way an easy bottom up, linear time inference is possible. But, if there is a global level renorm behind the current pass, the local level type infer is reduntant.
Choice 3 is to refact our IR interface to have a single "standardized" method of graph modifying in our infrastructure, like LLVM replaceAllUsesWith. So that, we can make sure all the new nodes are typed or untyped.
I think we can combine Choice 1 and Choice 3 to add a check when we relace the nodes. By using the nodes() or all_nodes() from graph manager, we can travel all the new nodes from the root node easyly.

@kungsen : Thank you very much for your sharing. Yes, we can have different possible solutions here. For long term purposes, in my opinion the Choice 3 is the best. I have just 2 concerns. One is once the new "standardized" method is implemented, will all the passes be re-written or they can stay as-is? The other is would the new "standardized" method need the "local type inference"? I think it would still need the "local type inference" since it's would be much cleaner if each pass takes care of the newly generated nodes, because at the graph level, for some cases the total number of nodes is very high (Bert may have ~30k nodes) so that a whole tree walk would be costly.

The related PR from Linh: !121:Add a checking mechanism for the need of Renormalize pass in Parse pipeline

Hi, Linh. Thank you for your RFC & PR. It is a good step to do type-preserved opt to reduce the cost of renorm.

I have two question.
Question 1: Can renorm of opt_a also be checked? As in the last iteration of opt_a, the renorm may be reduntant.

Quetions 2: I think there maybe unsufficient that whether there is a case the new substituted result node is typed while some of its inputs are constructed with untyped? If there is no such case in our current opt_b passes currently, we need to add this as a new rule to all passes. If there are some untyped nodes in substituted nodes, the root of these nodes must be untyped.
There may be three choices we can choose to make sure all the sub-passes are "shallow".
Choice 1 is obey the rule manually. Every new subtree to replace has to obey the rule "if ever a subnode is untyped then the root must be untyped".
Choice 2 is to use "local type inference". Because before substitution every happens, type information is complete and thus all the types of the parameter are there; so every subnode in the subtree that is parameter can be easily typed -- in that way an easy bottom up, linear time inference is possible. But, if there is a global level renorm behind the current pass, the local level type infer is reduntant.
Choice 3 is to refact our IR interface to have a single "standardized" method of graph modifying in our infrastructure, like LLVM replaceAllUsesWith. So that, we can make sure all the new nodes are typed or untyped.
I think we can combine Choice 1 and Choice 3 to add a check when we relace the nodes. By using the nodes() or all_nodes() from graph manager, we can travel all the new nodes from the root node easyly.

@kungsen I actually don't quite understand your concerns on "redundant computation" in Choice 2 because in my opinion, Choice 1 and Choice 2 should be the same linear complexity (traversal of the subtree). The only reason I think Choice 1 should be preferred is its simplicity in implementation; but Choice 2 makes more sense because its logic is more direct compared to Choice 1. (we are facing problems of untyped nodes).
I think the "redundancy" only appears in the sense of code duplication.

@kungsen , @thlinh and @kaitingwang talked about the idea to amortize your Choice 1 by introducing a template class. We have come up with a RFC #I1DXI0:"Typed"-Tag Indicating well-typedness of a whole subtree for Type Preserving to be helpful for this patch. Please take a look. Thanks :)

@thlinh As you modified the checking action, could you please update the results of the different networks mentioned in this issue? And I prefer the total time should be added to the results. You can get it easily in the log.

@kungsen : Related to your question 1: Can renorm of opt_a also be checked? As in the last iteration of opt_a, the renorm may be reduntant. The answer is Yes, it's possible. I'll extend later a solution for signalling between the sub-passes of a group. In the 1st version of this RFC, we check for the whole group of sub-passes only.
More on your question 2: I think we can also add a property flag for each sub-pass to define if the given pass is "shallow" or not, it means that it will tell us that for a pass we just need to check the "result" node or we need to check the whole tree rooted at "result" node. In case that we need to check the whole tree rooted at "result", please see more at https://gitee.com/mindspore/dashboard?issue_id=I1DXI0 the idea of @Ende.Jin and @kaitingwang that we can save checking time if knowing which subtree is already fully typed.

@kungsen : Thank you very much for your sharing. Yes, we can have different possible solutions here. For long term purposes, in my opinion the Choice 3 is the best. I have just 2 concerns. One is once the new "standardized" method is implemented, will all the passes be re-written or they can stay as-is? The other is would the new "standardized" method need the "local type inference"? I think it would still need the "local type inference" since it's would be much cleaner if each pass takes care of the newly generated nodes, because at the graph level, for some cases the total number of nodes is very high (Bert may have ~30k nodes) so that a whole tree walk would be costly.

@thlinh I think small adaption of "standardized" method for all pass is acceptable. Before the "standardized" method is ready, we can obey the rule manully to get the gain, as "if ever a subnode is untyped then the root must be untyped" is easy to obey manually. The "local type inference" is still needed.

@kungsen I actually don't quite understand your concerns on "redundant computation" in Choice 2 because in my opinion, Choice 1 and Choice 2 should be the same linear complexity (traversal of the subtree). The only reason I think Choice 1 should be preferred is its simplicity in implementation; but Choice 2 makes more sense because its logic is more direct compared to Choice 1. (we are facing problems of untyped nodes).
I think the "redundancy" only appears in the sense of code duplication.

@Ende.Jin The "redundancy" means that if there is a global level renorm later doing "local type infer" in every pass is reduntant. If we can collect all untyped nodes, and only "local type infer" need to be done would be prefered.

@thlinh As you modified the checking action, could you please update the results of the different networks mentioned in this issue? And I prefer the total time should be added to the results. You can get it easily in the log.

@biffex : I have re-run the tests with MS master repo (before PR) and with changes in PR. Here are the latest results (run on my blue laptop). In general the checking takes some amount of time, but the total performance is still improved about 70% for the renormalize passes.

For BERT: Originally the opt_b.renormalize takes about 2.93% of the total (parsing) time, opt_control.renormalize takes 1.47%, which contributes a total of 4.5%. With the changes in this RRC, the total time drop to almost 1.64%.
For RESNET train test: Originally the total time for opt_b and opt_control passes uses 5,4% of total time, with this RFC, the total time drops to 1,99%.
For RESNET infer test: Originally the time for opt_b (no opt_control in this case) was 0,67%. With this RFC, the time drops to almost 0%.
For LeNet network: Originally the time for opt_b and opt_control passes was 3,64%. With this RFC, the time again drops to 1.25%.

@kungsen : There is cases (for example in test_graph_return_const_param.py), where the "return" was single node of type AbstractTensor (no other nodes were generated) but still need a full Renormalize to be well-typed. And there are obviously (as can be seen in validator.cc) types that are not supported later (at back-end), which will cause the validating agent to raise Exception there. So for those types, I still set to call the Renormalize to: 1. have the node further processed, 2. there may be a chance that a part of Renormalize (like Specialize) can deal with those nodes with unsupport types so that it would pass the validating later. By adding more cases for "2", we sacrifice a little in the performance with the hope that we can support better more cases, but it will be safer for the parser.

@kungsen : There is cases (for example in test_graph_return_const_param.py), where the "return" was single node of type AbstractTensor (no other nodes were generated) but still need a full Renormalize to be well-typed. And there are obviously (as can be seen in validator.cc) types that are not supported later (at back-end), which will cause the validating agent to raise Exception there. So for those types, I still set to call the Renormalize to: 1. have the node further processed, 2. there may be a chance that a part of Renormalize (like Specialize) can deal with those nodes with unsupport types so that it would pass the validating later. By adding more cases for "2", we sacrifice a little in the performance with the hope that we can support better more cases, but it will be safer for the parser.

@thlinh I see the changes in your PR. Thank you for your deep analysis. In my opinion, we prefer a simple rule to guide the need of renorm instead of a complex condition tested by CI. Otherwise, if others add a new pass, they have no idea whether the new node will be renorm. I have two suggestion for this PR before partial renorm is ready.
Suggestion1: Find a simple rule to follow, we can change some pass implementation to follow this rule.
Sugeestion2: Manually add need further renorm node in each pass, as the developer of each pass should known whether the replaced node needed renorm.

@thlinh I see the changes in your PR. Thank you for your deep analysis. In my opinion, we prefer a simple rule to guide the need of renorm instead of a complex condition tested by CI. Otherwise, if others add a new pass, they have no idea whether the new node will be renorm. I have two suggestion for this PR before partial renorm is ready.
Suggestion1: Find a simple rule to follow, we can change some pass implementation to follow this rule.
Sugeestion2: Manually add need further renorm node in each pass, as the developer of each pass should known whether the replaced node needed renorm.

@kungsen : Thank you very much for your suggestions. Since the main idea of this RFC is at level of group of sub-passes so I'll go with your Suggestion 2 first. I will add two optional flags for the Substitution: "is_renorm_in_watch" and "is_validate_abstract". The first flag will force to call Renormalize if that pass return a new nodes (if the pass was empty then Renormalize will depend on other passes). The second flag will force to check if the nodes have acceptable types (otherwise it will be later rejected too by validator). I set these flags for the two passes "make_ref_eliminate" and "replace_refkey_by_param" as example. The tests went good for me.

@gongchen-0812
whether the issue can be closed?

Yes, the related MR has been merged.

GVP MindSpore / mindspore

内容风险标识